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Preface 



The first International Workshop on Interactive Distributed Multimedia Systems and 
Telecommunication Services (IDMS) was organized by Prof. K. Rothermel and Prof. 
W. Effelsberg, and took place in Stuttgart in 1992. It had the form of a national forum 
for discussion on multimedia issues related to communications. The succeeding event 
was "attached" as a workshop to the German Computer Science Conference (GI 
Jahrestagung) in 1994 in Hamburg, organized by Prof. W. Lamersdorf. The chairs of 
the third IDMS, E. Moeller and B. Butscher, enhanced the event to become a very 
successful international meeting in Berlin in March 1996. 

This short overview on the first three IDMS events is taken from the preface of the 
IDMS '97 proceedings (published by Springer as Lecture Notes in Computer Science, 
Volume 1309), written by Ralf Steinmetz and Lars Wolf. Both, Ralf Steinmetz as 
general chair and Lars Wolf as program chair of ID MSP?, organized an excellent 
international IDMS in Darmstadt. 

Since 1998, IDMS has moved from Germany to other European cities to emphasize 
the international character it had gained in the previous years. IDMS '98 was 
organized in Oslo by Vera Goebel and Thomas Plagemann at UniK - Center for 
Technology at Kjeller, University of Oslo. Michel Diaz, Phillipe Owezarski, and 
Patrick Senac successfully organized the sixth IDMS event, again outside Germany. 
IDMS'99 took place in Toulouse at ENSICA. IDMS 2000 continued the tradition and 
was hosted in Enschede, the Netherlands. 

The goal of the IDMS series of workshops has been and still is to bring together re- 
searchers, developers, and practitioners from academia and industry; and to provide a 
forum for discussion, presentation, and exploration of technologies and advances in 
the broad field of interactive distributed multimedia systems and telecommunication 
services, ranging from basic system technologies such as networking and operating 
system support to all kinds of teleservices and distributed multimedia applications. To 
accomplish this goal IDMS remains relatively "small": it has no parallel sessions and 
a limited number of participants to encourage interaction and discussion. 

Although IDMS2000 had tough competition from other conferences and workshops, 
it received 60 submissions from Europe, Asia, Africa, and North and South America. 
Every paper was refereed by at least three reviewers. A tedious job, luckily made 
easier with help from the excellent online conference tool ConfMan, developed in 
Oslo for the 1998 IDMS workshop. Ultimately, the 26 members of the program 
committee (PC) and 33 referees selected 24 high quality papers for presentation at the 
workshop. The main topics of IDMS2000 are: efficient audio/video coding and 
delivery, multimedia conferencing, synchronization and multicast, communication, 
control and telephony over IP networks, QoS models and architectures, multimedia 
applications and user aspects, design and implementation approaches, and mobile 
multimedia and ubiquitous computing systems. This technical program is 
complemented with three invited papers: "Energy-efficient hand-held multimedia 
systems" by Gerard Smit, "Short-range connectivity with Bluetooth" by Jaap Haartsen 
and "On the failure of middleware to support multimedia applications" by Gordon 
Blair. 
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Abstract. The trend in wireless terminals has been to shrink a general-purpose 
desktop PC into a package that can be conveniently carried. Even PDAs have 
not ventured far from the general-purpose model, neither architectural nor in 
terms of usage model. Both the notebook and the personal computer generally 
use the same standard PC operating system such as Windows (CE) or Unix, 
same applications, use the same communication protocols and use the same 
hardware architecture. The only difference is that portable computers are 
smaller, have a battery, a wireless interface, and often use low power 
components [2]. 

Even though battery technology is improving continuously and processors and 
displays are rapidly improving in terms of power consumption; battery life and 
battery weight are issues that will have a marked influence on how hand-held 
computers can be used. Energy consumption is becoming the limiting factor in 
the amount of functionality that can be placed in these devices. More extensive 
and continuous use of network services will only aggravate this problem since 
communication consumes relatively much energy. 

Another key challenge of mobile computing is that many attributes of the 
environment vary dynamically. Mobile devices face many different types of 
variability in their environment [3]. Therefore, they need to be able to operate 
in environments that can change drastically in short term as well as long term 
in available resources and available services. Merely algorithmic adaptations 
are not sufficient, but rather an entirely new set of protocols and/or algorithms 
may be required. For example, mobile users may encounter a complete 
different wireless communication infrastructure when walking from their office 
to the street [4]. A possible solution is to have a mobile device with a reconfig- 
urable architecture so that it can adapt its operation to the current environment 
and operating condition. Adaptability and programmability should be major 
requirements in the design of the architecture of a mobile computer. 

We are entering an era in which each microchip will have billions of 
transistors. One way to use this opportunity would be to continue advancing 
our chip architectures and technologies as just more of the same: building 
microprocessors that are simply complicated versions of the kind built today. 
However, simply shrinking the data processing terminal and radio modem, 
attaching them via a bus. and packaging them together does not alleviate the 
architectural bottlenecks. The real design challenge is to engineer an integrated 
mobile system where data processing and communication share equal impor- 
tance and are designed with each other in mind. Just integrating current PC or 
PDA architecture with a communication subsystem, is not the solution. One of 
the main drawbacks of merely packaging the two is that the energy-inefficient 
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general-purpose CPU, with its heavyweight operating system and shared bus, 
becomes not only the center of control, but also the center of data flow in the 
system and a main cause of energy consumption [1], 

Clearly, there is a need to revise the system architecture of a portable computer 
if we want to have a machine that can be used conveniently in a wireless 
environment. A system level integration of the mobile’s architecture, operating 
system, and applications is required. The system should provide a solution with 
a proper energy-efficient balance between flexibility and efficiency through the 
use of a hybrid mix of general-purpose and the application-specific approaches. 
The key to energy efficiency in future mobile systems will be designing higher 
layers of the mobile system, their system architecture, their functionality, their 
operating system, and indeed the entire network, with energy efficiency in 
mind. Furthermore, because the applications have direct knowledge of how the 
user is using the system, this knowledge must be penetrated into the power 
management of the system. 
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Realisation of an Adaptive Audio Tool 



Arnaud Meylan and Catherine Boutremans 
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Swiss Federal Institute of Technology at Lausanne (EPFL) 
CH-1015 Lausanne, Switzerland 

Abstract. Real-time audio over the best effort Internet often suffers 
from packet loss. So far, Forward Error Correction (FEC) seems to be 
an efficient way to attenuate the impact of loss. Nevertheless to ensure 
efficiency of FEC, the source rate must be continuously controlled to 
avoid congestion. In this paper, we describe a realisation of adaptive FEC 
subdued to a TCP-friendly rate control. 



1 Introduction 

The Internet has changed from mainly a file transfer and e-mail tool to a network 
for multimedia and commercial applications, among others. This change brought 
up many new technical challenges such as transport of real-time data over non 
real-time lossy networks, which has been fulfilled by the Real-time Transport 
Protocol (RTP) 0 and FEC techniques j|j. 

Unfortunately, FEC is too often used without rate control, what leads to 
more congestion, loss and then worse audio quality 0. The purpose of this work 
is to add adaptive FEC to an existing software: the Robust Audio Tool |7j- FEC 
will be constrained by a TCP-friendly rate, proposed by Mahdavi and Floyd El- 

In this paper the general problem of optimizing quality at reception is pre- 
sented, a more specific solution is deduced. 

2 State of the Art 

2.1 FEC 

Definition. Forward Error Correction (FEC) relies on the addition of repair 
data to a stream, from which the content of lost packets may be recovered 
at destination, at least in part. Two classes of repair data may be added to a 
stream j3j : those which are independent of the contents of that stream (e.g. Parity 
coding, Reed-Solomon codes) and those which use knowledge of the stream to 
improve the repair process. In the context of real-time audio the most popular 
scheme, which was standardized by IETF is repetition of audio units on the 
stream. In the following, we will refer to this scheme as Signal processing based 
FEC (SFEC). 



SFEC. In order to simplify the following discussion, we distinguish a (media) 
unit of data from a packet. A unit is an interval of audio data, as stored inter- 
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nally in an audio tool. A packet comprises one or more units, encapsulated for 
transmission over the network. 

The principle of FEC used here is to repeat units of audio in multiple packets. 
If a packet is lost then another packet containing the same unit will be able to 
cover loss and be played -providing it arrives. This approach has been advocated 
in OJ and 0 for use on the Mbone, and extensively simulated by Podolsky |§. who 
calls this framework “Signal-processing based FEC” (SFEC0). Redundant audio 
units are piggy-backed onto a later packet, which is preferable to the transmission 
of additional packets, as this decrease the amount of packet overhead and routing 
decisions. 

There is a potential problem with apply redundancy in response to loss be- 
cause when loss occurs the (user) reflex is to add redundancy at the source. This 
leads to increase the overall rate transmission and congestion, probably produ- 
cing even worse quality. Effectively, redundancy can be added when loss occurs, 
but in the same time the source encoding must be changed to use less bandwidth 
in response to congestion 0- Our framework is to continuously ensure that the 
overall rate transmission is smaller or equal than a TCP-friendly rate proposed 
by Mahdavi and Floyd m ■ It is a combined source rate/redundancy control. 

2.2 TCP-Friendly Rate Control 

As networked multimedia applications become widespread, it becomes increa- 
singly important to ensure that they can share resources fairly with each other 
and with current TCP-based applications, the dominant source of Internet traf- 
fic. The TCP protocol is designed to reduce its sending rate when congestion is 
detected. Networked multimedia applications should exhibit similar behavior, if 
they wish to co-exist with TCP-based applications. 

One way to ensure such co-existence is to implement some form of congestion 
control that adapts the transmission rate in a way that fairly shares bandwidth 
with TCP applications. One definition of fair is that of TCP friendliness 113- 
the non-TCP connection should receive the same share of bandwidth (namely 
achieve the same throughput) as a TCP connection. 

Mahdavi and Floyd m have derived an expression relating the average TCP 
throughput Rtgp to the packet loss rate: 



-Rtcp = 1-22 



MTU 
RTT x 



(1) 



where MTU is the packet size being used on the connection; RTT is the round 
trip time and 7r^ is the loss rate being experienced by the connection. 

In our application, we fix the MTU to 576 bytes, the minimum size for TCP. 
Current values for RTT are obtained using RTCP Sender and Receiver Reports. 
The packet loss rate 7r* is computed at the receiver and reported to the sender 
via the Fraction lost field of the Receiver Reports. 

1 “because this approach exploits a signal-processing based model of the audio signal 
to effectively compute its error-correcting information” 
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3 The Robust Audio Tool 

3.1 Software Presentation 

The Robust Audio Tool (RAT) [7| is an open-source audio conferencing and 
streaming application that allows users to particpate in audio conferences over 
the Internet. 

This software is based on IETF standards, it uses RTP above UDP/IP as its 
transport protocol, according to the RTP profile for audio and video conference 
with minimal control 00- RAT features a range of different rate and quality 
codecs, receiver based loss concealment to mask packet losses, and sender based 
channel coding. 

3.2 Some Useful Concepts about RAT 

The channel coder. Let us define the following terms: 

A codec is defined as an encoder-decoder performing one type of compres- 
sion/decompression on the audio stream. The encoder part takes a buffer of 
audio stream in input, performs encoding and outputs a playout buffer of media 
units. 

The channel coder builds channel units which represent the payload of the 
RTP packets. In RAT the channel coder performs four different kind of channel 
coding, especially three allowing FEC called redundancy (SFEC), interleaving 
and layered. For our work, FEC will be added in the form of SFEC, using the 
redundant channel coder. 



The Message Bus. RAT comprises three separate processes: controller, me- 
dia engine and user interface. Communication between them is provided by the 
Message bus (Mbus), the sole means to ensure coordination of multimedia con- 
ferencing systems. 

The Message bus was proposed by Colin Perkins and Jorg Ott m ■ It solves 
the typical problem of separate tools providing audio video and shared workspace 
functionality. It maps well on the underlying RTP media streams, which are also 
transmitted separately 

A message contains a header and a body. The former indicates notably the 
source and destination adresses the latter contains the message having to be 
delivered to the application. The message is transmitted by UDP in the form of 
a string. A function maps it into a C function call at destination (unmarshalling). 
We considered the following MBus commands: 

Command name Usage 

tool. rat . codec ( . ) Specifies primary codec being used by this source 

audio . channel . codingC . ) Specifies secondary codec, and its relative offset to the 

primary 

tool. rat ,rate( . ) Sets the number of audio units (codec frames, typically) 

placed in each packet when transmitting, assuming a unit 
is t ms long 
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These messages provide high level access to the program. The biggest part 
of implementation was made using these commands. 



4 Adaptive FEC Algorithm 

4.1 The General Problem 

We want to maximize a certain measure of quality at reception. Quality can 
be estimated according to a non-subjective (loss rate after reconstruction) or 
subjective measure (perceived quality at destination depending on the quality 
of each audio unit played). To realize this, the SFEC-channel coder algorithm 
must choose a range of variables: 

— k denotes the number of copies sent for the same audio unit, k is bound 
to the loss probability of the network and desired apparent loss rate after 
reconstruction at reception. 

— o = [oi, 02 , . . . , Ok] denotes the offset of redundant unit i relative to the first 
transmitted unit. One always have o\ = 0. 

It has been shown EH that loss presents some degree of correlation: if packet 
n is lost, then there is a higher than average probability that packet n + 1 
will also be lost. This indicates that an advantage in perceived audio quality 
can be achieved by offsetting the redundancy. The disadvantage of offsetting 
is the increase of the playout delay. It’s a compromise between delay and 
robustness. Nevertheless in case of half duplex for example, delay is not 
important and offset can be extensively used. 

— The set of codecs available is associated with a finite set of encoding rates 

ri : 1Z = {rj}" =1 - Notice r 0 = 0 corresponds to no coding. Codecs are ordered 
according to their bit rates, namely * < j •£=> < rj. 

— Let x = {xift-Q denote the number of codec used to encode the i th redun- 
dant unit. 

— r is the total payload rate r = X^ =1 r Xi 

— t denotes the frame duration of an audio unit, it is usually 20 ms, 40 ms or 
80 ms. It influences end to end delay to increase t and reduces the number 
of packets sent, so diminishes the header overhead. 

Input parameters are: 

— The TCP-friendly rate constraint R which is slightly smaller than .Rtcp- 
Indeed, -Rtcp provides an upper bound to the total throughput -Rtcp — R+ 
.Rheaders allowed for the application. R only represents the maximum payload 
(audio) throughput. -Rheaders is the IP/UDP/RTP header throughput: this 
application sends (20 + 8 + 12) bytes of header each t ms . The payload 
rate is then 

R = -Rtcp - — [kb/s]. (2) 

— The parameters of the loss process on the network, typically p and q for the 
Gilbert model. 
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In this context, it would be useful to maximize a function f(k, o, x, f, R,p, q) 
representing the quality at reception. Our solution will be less general, the (re- 
strictive) hypothesis are presented below. 

4.2 Models 

Loss Model. The characteristics of the loss process of audio packets are im- 
portant to determine how to use SFEC. Previous work on the subject 
propose a two state Markov Chain, the Gilbert model for the loss process (Fig. 
IB- States 1 and 0 respectively correspond to a packet loss and packet reaching 
destination. Let q denote the probability of going from state 1 to state 0, and 



The actual implementation of the software does not report q. An assumption 
is then needed. Usually one assumes p+q = 1, and the model turns in a Bernoulli 
of parameter (7 Tq) loss process in which losses are independent. Since this is not 
realistic im> we prefer to assume that the average packet loss length is 1.5, 
therefore q = 2/3. This ambitious assumption is based on jl-lj and presents 
the advantage to reflect the verified correlation of loss process. 7 r* represents the 
average loss rate, such as reported by R.TCP: 7r* = p+ ^ 3 . So p is: 



Loss Rate after Reconstruction. We are then interested in loss rates after 
reconstruction, as a function of k and o. At this stage, a drastic simplification 
due to actual implementation of RAT appears: k cannot exceed 2, only one level 
of redundancy can be used. Therefore only 02 remains to compute. 

Let us examine the loss rate after reconstruction for different offsets (we will 
denote 02 by n) . The probability p to lose packet l and l + n is easy to compute 
using basic probabilities and Markov chains properties 




p 




q 



Fig. 1. The Gilbert Model 



p = P ( pack l lost D pack l + n lost) 

= (pack l + n lost | pack l lost)P(pack l lost) 
= P(pack n lost | pack 0 lost)P(pack 0 lost) 
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Where is the probability to return in state i after n steps, i.e. it is the ii th 
entry of the n step transition matrix P n . 

For this chain one can compute m that 



pn 1 


qp 


, (1 ~p-q) n 


p -p 


p + q 


qp 


p + q 


- q q _ 



hence 

Pii = ^(P + 9(l-P-9)") • (6) 

p + q 

Therefore the probability to lose both packets is 

P = (p + g )2 (P+g( 1 ^P~g) n ) ■ ( 7 ) 

The objective is to minimize p since it represents the probability to fail to 
repair the audio stream. The only adjustable parameter is n (positive integer) 
since p is bound to the network’s state, and we assumed a constant value for q, 
therefore we compute 

t> = T^rp ? {1 - v - qrH1 - v - q) • (8) 

Extrema are obtained for zeros of (0- Three cases are distinguished: 

— 0 < 1 — p — q<l All members of 0 are positive, for n —> oo, — > 0. 

n — > oo leads to a minimum, we do not prove it here. This means that 
the more you offset for the redundant audio unit the smaller the loss rate 
after reconstruction. But in the same time playout delay increases since the 
application must “wait” for the redundant units. 

— 1 — p — q = 0 All values of n are optimal since p is then constant. 

— — 1 < 1 — p — q<0 The logarithm function is not defined. We constat on 
(0 that the smallest value of p is obtained for n = 1. 

Conditions can be simplified in: p < 1/3, p = 1/3, p > 1/3 since q = 2/3. 

Figure 0 shows plots of p as a function of n and p, first one illustrates case 
p < 1/3 second one all three. 

On Fig. Q(a) we can constat that when n increases p tends fast to its limit. 
For n = 3, p already quite equals the limit. More offsetting probably brings a 
small reduction of p while it increases playout delay. But we do not precisely 
know which k is optimal since penalty induced by delay on audio quality is 
not clearly defined. This explains the need for the above mentioned function 
appraising quality at reception. Without it, a quite arbitrary choice will be to 
set n = 3. Nevertheless it nicely improves the loss rate after reconstruction 
compared to n = 1 and the playout delay introduced seems reasonable^. 

Two last cases are clear. When p > 1/3 one clearly chooses n = 1 since it 
guarantees the smallest delay and loss rate after reconstruction. For p = 1/3 any 

extra playout delay due to offset is then 3-frame duration 
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Fig. 2. p as a function of n and p. (a) p £ [0, 0.33], (b) p £ [0, 1] 



value of n is optimal, so we choose n = 1 since it minimises delay. We imple- 
mented such an algorithm to select offset of redundand unit: 

p = 2/3 * 7rJ /(I — 7 r*); // t :\ denotes loss rate reported by RTCP 
if (p < 1/3) n = 3; 
else n = 1; 

These results are specific to this value of q in the general case, the if should 
be with p + q. 



Redundancy Level. At this point, we have to decide if redundancy must be 
used or not, depending on loss rate perceived at reception and the target loss rate 
at destination r. In absence of redundancy the loss rate equals 7r* , if redundancy 
is used the loss rate after reconstruction is p. 

It is not very clear how much audio loss can be tolerated at destination, it 
depends on voice reconstruction techniques, audio unit’s length and the subjec- 
tive tolerance of user. It appears CH that with packet repetition techniques 5% 
loss rate after reconstruction can be tolerated. We assume then r = 0.05 and 
choice about redundancy becomes: 

if (7r* > r) k = 2; 
else k = 1 . 

It is interesting to remark that with this model and one single piece of redundant 
audio offseted, loss rates on the network up to 22% leads to less than 5% of loss 
rate after reconstruction. 30% network loss gives 9%. With these theoretical 
results more than one level of rendundancy seems useless. 
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Encodings. The last parameters to determine are the codecs used for enco- 
ding. We use a Free-Phone m inspired algorithm which solves the problem 
assuming a variable number of redundancy levels can be used. Since we have 
one piece of redundancy at most we derive a simpler algorithm that uses their 
general result: “The main information^ should be encoded using the highest qua- 
lity encoding scheme (among those used to encode the main and the redundant 
information)” . 

A set of codecs must be chosen before running the algorithm. We picked one 
performing encoding in the range of 5.6 kb/s to 64 kb/s with 8 kHz sampling 
frequency: LPC (5.6 kb/s), GSM (13.2 kb/s), G726-16 (16 kb/s), G726-24 (24 
kb/s), G726-32 (32 kb/s), G726-40 (40 kb/s) and A-law (64 kb/s). It repre- 
sents a quite homogeneous distribution of usable rates, so ensures that available 
bandwidth will be correctly used. 

We ordered the codecs according to their bit rates, this differs slightly from 
their order of quality. One assumes that choosing a codec with higher bit-rate 
provides better audio quality. 

k denotes the redundancy level (k £ {1; 2} here) x i is the codec number 
for the i th copy, R is the available throughput and r, is rate used by i th codec 
assuming N codecs are available. The following algorithm is valid for any k > 1, 
it gives the codec number to use for i th redundancy level i = 1 : k. 

Xo = 1; // ensures minimum encoding 
for (i=l :k-l : i++) x, = 0; r = r xo ; 
i = 0; 
do{ 

if Cxi < N) Xi++; 
r= 0 ; 

for ( j=0; j<k; j++){ 

r += r Xj ; 

} 

if (r > R){ 

if(a;o==l && £i==0) break; 

Xi — ; 
break; 

} 

ifCrr/c-i = N) break; // all codecs have the best quality 
i — ++z "/. k ; 

}while(r < R ) ; 



Frame duration. We chose t = 20 ms as frame duration. It enables smaller 
latency and reduces time to wait to receive offseted packets. 
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5 Evaluation 

5.1 Results 

We have implemented the above described adaptive channel coding subdued to 
rate control. Tests were performed between EPFL (Switzerland) and two destina- 
tions in Europe: the University College of London and the Faculte Polytechnique 
de Mons in Belgium. 

During one week connections were established with UCL at different times of 
the day. The loss rates never significantly exceeded 1% so SFEC was not needed 
and the best possible primary quality was chosen by the algorithm. The absence 
of loss is certainly explained by the very high speed links available beetween 
both universities. 





Fig. 3. Source rates and network loss 

Connections with the Faculte de Mons were more lossy, the network was in a 
high degree of congestion since the RTT was about 1.3 sec. and the average loss 
rate above 10%. Figure O shows the TCP-friendly rate constraint R the payload 
rate r xi + r X2 and the rate used by the primary encoding r Xl during a 400 sec. 
connection. This permits to see when redundancy is applied in response to loss 
on the network which is presented on the bottom. 

The rate constraint is respected except when smaller than 5.6 kb/s because 
our algorithm keeps sending audio at this rate. In this situation the converstion 
should perhaps be stopped. When redundancy is needed bandwith is shared 
between primary and secondary encoding. In some cases redundancy should be 
used but a low rate constraint prevents it. 

We could hear a fine quality improvement at destination when using SFEC 
to attenuate the effect of loss. We do not provide measurements of loss rate after 
reconstruction here since the efficiency of the method was demonstrated by M 
and 0. 
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5.2 Future Work 

The most important thing to do seems to build a function to estimate quality 
perceived at reception. Without it, choices must be done statically. With it we 
could change dynamically offset or frame duration, and probably provide better 
quality. 

Lots of hypothesis where made to build the model, the roughest was certainly 
to assume q = 2/3. It would be useful to have an estimator for this value, 
computed at destination and reported by RTCP. 

In rare cases (loss rates > 22%), one more level of redundancy could be 
useful. RAT’s implementation does not support this at the present time but it 
seems possible to add this functionality. Notice that then the presented model 
must be completed to compute the probability to lose all three packets. In our 
opinion more than three levels of redundancy are not useful. 

We stated that a 64 kb/s (A-law) codec provides better quality than a 32 
kb/s (G726-32), in our experience the difference is really faint. Instead it could 
be better to double sampling frequency and keep the G726 codec to get 8 kHz 
bandpass and 64 kb/s rate. It would have the convicing argument to provide 
larger bandwidth than Plain Old Telephone System. 



6 Conclusion 

In this paper we proposed a method to provide adaptive source coding for the 
Robust Audio Tool. It satisfies fair-rate constraints while improving efficiently 
quality perceived at reception thanks to adaptation of redundancy level, enco- 
ding rates and offset. Nevertheless the proposed algorithms are certainly not 
optimal because we had to make some suppositions. 
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Abstract. The usage of multimedia applications on the Internet has 
seen phenomenal growth in recent years. Transport protocols that pro- 
vide partially reliable service have been suggested as one approach to 
better handle the requirements of these applications. A partially relia- 
ble service provides applications with the possibility of a flexible tradeoff 
between reliability and delay/throughput. Appropriately designed coders 
are, however, required to fully utilize a partially reliable service. In this 
paper we present a JPEG image coder tailored to suit the behavior of 
a partially reliable byte stream service. With regular JPEG, data loss 
typically results in severely distorted images. The robust recoder em- 
ploys three major modifications to standard JPEG in order to adapt to 
the partially reliable transport: (1) extended resynchronization markers 
in order to be able to resynchronize effectively, (2) block interleaving in 
order to spread out the loss of a packet across the image and (3) er- 
ror concealment in order to minimize the perceived quality loss. The 
modifications incorporate both new inventions, such as random window 
interleaving, as well as variations of previously known techniques. 



1 Introduction 

Internet has proven to be a formidable success, and it has continued to grow 
with new applications and new communication link technologies never originally 
conceived. The research community continues to enhance and evolve the Internet 
and its services. One area of research centers on the concept of partially reliable 
and partially ordered transport services. The most common Internet transport 
protocol, TCP PBJ, provides a fully reliable, ordered transport service. Since the 
underlying IP layer only provides an unreliable, unordered service the transport 
layer must in this case provide reliability and reordering mechanisms. The me- 
chanisms used to obtain this incur a user cost in the form of added delay needed 
for retransmissions and reorder buffering. Even if the application does not need 
a fully reliable, fully ordered transport service, it experiences the penalty of ad- 
ded delay. Neither TCP nor UDP [Ejj, the other common Internet transport 
protocol, is suitable for applications that can accept some controlled degree of 
loss or out-of-order delivery. By using transport protocols such as POC j2J or 
PECC 0 that provide a transport service with flexible reliability and/or orde- 
ring constraints, it is possible to obtain increased application performance by 
relaxing the reliability and ordering constraints. Increased performance for a vi- 
deo application by relaxed ordering and reliability constraints has been shown in 
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m ■ Increased performance can also be obtained for non-streaming applications 
such as image transfer. One example is the progressive display of GIF images 
m- The JPEG image coding standard m is very common on the Internet due 
to the vast amount of web pages with embedded JPEG images. No reports of 
the possible benefits of using a partially reliable service for transferring JPEG 
web images has been found in the literature. This paper describes the design and 
implementation of a JPEG image coder adapted to work well with a partially 
reliable transport service. This coder is intended to be used as a component in 
a test environment where system-wide advantages and disadvantages of using 
partially reliable transport for web transfers can be evaluated. The robust coder 
contains both known techniques as well as new ones, such as random window 
interleaving. The remainder of this document is structured as follows. In Sect. 2 
background information and related work is presented. Section 3 gives a brief 
overview of JPEG, focusing on the details needed to understand the modifica- 
tions necessary to accommodate for partial reliability. These modifications are 
described in Sect. 4. Finally, conclusions are given in Sect. 5. 

2 Background and Related Work 

We have developed PRTP, a partially reliable transport protocol, which allows 
the required reliability level to be specified by the application. This allows the 
application to influence the tradeoff between latency and reliability to optimally 
suit the current operating conditions and user preferences. The PRTP transport 
protocol allows the reliability to be specified between 0 and 100%. Further de- 
tails of PRTP and recent performance measurements are presented in £j|. For 
ease of implementation and simple WWW integration, the first implementation 
of PRTP is based on TCP. This allows the use of TCPs mechanisms for flow con- 
trol and retransmissions. It also leads to some disadvantages inherent in TCPs 
design, such as the inability of performing Application Layer Framing (ALF) jSj. 
An extensive simulation study CHI shows that PRTP has a more aggressive 
congestion control behavior than TCP. As discussed in Jf, a more aggressive 
behavior can however be appropriate in certain situations. 

When transporting image data, the loss of data may lead to information loss. 
One characteristic of the human visual system is its ability to extrapolate lost 
information to a certain degree. In order to ease this extrapolation, the coding of 
visual data must take the possibility of data loss into account. However, when an 
excessive amount of data is lost, a severe degradation in image quality will occur. 
This motivates the use of a partially reliable protocol that puts an upper bound 
on the amount of data loss possible, and correspondingly ensures a minimum 
image quality. 

Other work examining image transfer over unreliable or partially reliable 
channels include i4nomim However, none of these proposals present a 
solution that integrates easily into a web application, codes color images of ar- 
bitrary size and is able to handle the packet loss characteristics provided by the 
PRTP partially reliable transport service. 
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3 JPEG Basics 

JPEG is a relatively complex standard that can be operated in a number of mo- 
des. Our coder is primarily aimed at transfer of web-type images, and these are 
typically coded in the lossy sequential baseline mode and stored in the JFIF HZ) 
interchange format. In order to give a background on how the suggested adapta- 
tion differs from the normal approach, a brief, somewhat simplified, introduction 
to baseline JPEG encoding is provided. The encoding of an image is typically 
comprised of the following steps: 

1. Change the image representation to the YCt,C r color space and down-sample 
the chrominance components (C &, C r ). 

2. Split the components into 8x8 sample point blocks and perform a DCT (Di- 
screte Cosine Transform) on each block. The resulting 64 coefficients are di- 
vided into one DC coefficient holding the average value of the sample points 
in the block, and 63 AC coefficients holding the amount of different spatial 
frequencies. 

3. Quantize the coefficients according to the quantization tables. Since higher 
spatial frequencies are less perceptually important, these are quantized more 
aggressively. 

4. Code the DC coefficient using predictive coding with the DC value of the 
previous block used as predictor. Run-length encode the AC coefficients in 
zigzag order. 

5. Huffman encode the run-length encoded data and insert JFIF headers. 

JFIF places the headers in the beginning of the file together with the tables ne- 
cessary to perform decoding. In baseline JPEG two quantization tables and four 
Huffman tables are required. The size of these headers and tables range from 200 
to 600 bytes. The header data must be reliably transmitted, which is easily done 
with PRTP. After the header data follows the Huffman coded data represen- 
ting the image. The exact data organization is dependent on the down-sampling 
used. A typical configuration is 2hx2v down-sampling which causes the data to 
be organized as four Y blocks, then one C & block and one C r block. In this case 
there will be six blocks grouped together, and this is the default configuration 
in our coder. This group of associated blocks is called a MCU (Minimum Coded 
Unit) and is used as an atomic unit for resynchronization. 

4 JPEG Adaptation 

The intended application area is web browsing, and the data loss type considered 
is packet loss, not bit errors or bit erasures. The coder must be able to handle the 
loss of one or more packets of variable size, and still resynchronize and conceal 
the loss as well as possible. The output quality from the coder should degrade 
gracefully as the loss rate increases. The typical loss rates expected will be lower 
than 10 %, with the transport service set to guarantee a maximum loss rate 
of 20%. In order to adapt the JPEG coding for partial reliability a strategy 
consisting of three steps is employed: 
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1. Extend the resynchronization capabilities of regular JPEG so that consecu- 
tive losses up to several kilobytes can be handled without losing too much 
data between the end of a loss and the first restart marker. 

2. Perform interleaving so that the lost data is not aggregated in one place in 
the image. 

3. Try to conceal the lost data by using redundancies not removed by the source 
coding. 

The steps are further explained below, and their effects are illustrated in 
Figs. nSI The images in the figures were produced by our coder and used an 
original web image of typical quality as input. The original image, shown in Fig. 
[0 is a 276 x 185 pixel color image coded at 2.3 bpp. All test images except the 
original image lose the same amount of data, 1.5 kbyte, at the same positions in 
the transfered data stream. This corresponds to around 10% packet loss, distri- 
buted as three lost packets of 512 bytes each. As can bee seen in Fig. |2] a regular 
JPEG image becomes considerably distorted by a 10% data loss. 




Fig. 1. Original image 



Fig. 2. Original image after loss 



4.1 Decoder Resynchronization 

Decoder resynchronization is needed to allow the decoder to come to a known 
state after a data loss of unknown length. For a web application, the losses are 
of unknown length since the application have no control over how the transport 
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protocol distributes the data into segments. This also precludes the possibility 
of using ALF principles which would considerably ease the task of decoder re- 
synchronization. We propose the use of extended resynchronization markers due 
to the specific problems that occur when using a partially reliable protocol that 
provides a stream abstraction, such as PRTP, in conjunction with JPEG coding. 
In this case the resynchronization problem can be considered to consist of three 
subproblems: 

— Coder internal state dependence 

Depending on the specific coding technique used, data can be more or less 
state dependent during decoding. GIF 0 for example is highly state depen- 
dent since it builds up a codebook using previous data. One approach to 
lower the state dependence for GIF images is presented in |Tj . The reduction 
in state dependence has to be paid for with lower compression performance. 
JPEG has a lower degree of state dependence, in this case originating from 
the predictive coding used for the DC coefficient. 

— Data stream semantics 

A problem occurs if the data stream contains bytes that have different se- 
mantic meaning or is comprised of variable length codes such as Huffman 
codes. When a loss occurs, the decoder must to resynchronize itself with the 
data stream so that it can apply the correct semantics to the data received 
after the loss. 

— Media positioning 

After a loss of unknown length, the data that comes after the loss must 
be processed. However, as the amount of lost data is unknown, it is not 
possible to know how much space in the image the lost data corresponds to. 
This makes it impossible to map the data received after a loss to a correct 
position in the image. 

The JPEG standard provides optional resynchronization capability. This capa- 
bility is based on the insertion of restart markers into the data stream. Eight 
unique restart markers are specified by the JPEG standard. The periodical in- 
sertion of restart markers at MCU boundaries solves the coder internal state 
dependence since the predictive coder used for the DC coefficient is reset at 
restart markers. The data stream semantics is resolved since restart markers 
always occur at MCU borders. The media positioning problem is resolved to a 
degree since the number of lost MCUs can be inferred from the marker num- 
ber and R, the number of MCUs between resynchronization markers. This is 
however only true for smaller losses that do not extend over more than eight 
restart markers. For data losses encompassing more than 8 R MCUs the decoder 
cannot unambiguously determine the actual amount of data that has been lost, 
and hence cannot position the incoming data correctly in the image. The value 
of R can be increased, hence allowing for larger data losses without loss of de- 
coder synchronization. Increasing R however also increase M , the mean amount 
of correct MCUs received after a data loss that cannot be decoded due to lack 
of decoder synchronization (M = R/ 2). This creates the optimization problem 
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of choosing a large enough R to ensure resynchronization while minimizing M. 
Rather than performing this optimization, we use extended restart markers that 
instead of eight unique markers provide 190 markers. This allows R to be set to 1 
and still guarantees unambiguous media positioning for all practical image sizes. 
Since the extension of the restart markers uses unused JPEG marker space, the 
extended restart markers provide enhanced resynchronization without having to 
use longer markers. The effect of using extended resynchronization markers in- 
stead of regular JPEG markers is illustrated in Figs. 0 and El where Fig. EDshows 
regular JPEG resynchronization markers inserted at each MCU row and Fig. El 
shows the effect of using extended restart markers inserted at each MCU. 




Fig. 3. JPEG regular restart 



Fig. 4. Extended restart 



When a data stream is adapted for resynchronization it becomes larger than 
the original. This data expansion occurs since resynchronization markers are 
inserted and the compression efficiency is lowered by the periodical resets of 
the predictive coding used for the DC coefficients. The data expansion can be 
expressed as an additional increase a of the resynchronized data over the original. 
In mi, the value of a is said to be approximately a = k-^, where p is the 
compression ratio and k is a constant that has been experimentally determined 
to 0.04 for black and white pictures. The value of k for color images is dependent 
on the amount of down-sampling of the chrominance components. Preliminary 
measurements indicate a data expansion lower than 10% for color images. 
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4.2 Interleaving 

Interleaving is performed with the objective of distributing data losses as evenly 
as possible in the image in order to facilitate error concealment by some kind 
of interpolation. In order to achieve this distribution two problems must be 
addressed: 

1. How to ensure that a data loss of a typical size is as evenly distributed as 
possible over the image. 

2. How to ensure that the interleaved blocks lost in a second data loss is dis- 
tributed as far away from the previously lost blocks as possible. 

In addition to distributing the blocks, the coefficients can also be distributed 
0 ■ The use of coefficient distribution is decoupled from the interleaving method 
used for block interleaving. Three block interleaving strategies will be presented, 
and these can all be used either with or without coefficient interleaving. 

In the following the length of a typical data loss is assumed to be L bytes. 
Such a loss should be as evenly distributed as possible over an image. To do this, 
the size of the image has to be known to ensure that the blocks will not be placed 
above each other by the interleaver. The pixel width and height of the image are 
denoted by W and H. The compression ratio expressed as p (bits/pixel) is also 
interesting since it determines the average number of lost blocks that maps to a 
data loss of size L. 




Fig. 5. Max. spreading interleaved 



Fig. 6. Max. spreading concealed 
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Maximum spreading interleaving. A simple interleaver would just use the 
values discussed above to space the interleaved blocks D blocks apart according 
to a formula such as 



8 F x 



D = 



w 


H 


W8 


S“ 8 



(1) 



The value F x is used to specify the relative amount of sample points present in 
component x , and S^ ,H is the amount of down-sampling used. For the typical 
2hx2v down-sampling the values for the Y component are Fy = | and S Y ’ = 1. 

For the color components the values are Fcb,Cr = \ and S^ H Cr = 2. If D is 
selected so that WmodD ^ 0 and DmocLW ^ 0, a satisfactory solution is 
achieved. This approach spreads a data loss evenly over the whole image, and 
for losses of length L a maximum spreading is achievecQ. However, if a second 
data loss occurs so that the deinterleaved blocks neighbor the previously lost 
blocks, then many lost blocks will have lost a neighboring block as illustrated 
in Fig. 0 This occurs since both losses are interleaved with a fixed interblock 
distance D. The repeated absence of a neighboring block severely hampers the 
performance of error concealment algorithms. This is the main disadvantage of 
this method. 



Random block interleaving. In order to minimize the probability of mul- 
tiple data losses incurring a stride of lost neighboring blocks it is possible to 
use random block interleaving. This interleaving uses a seeded random number 
generator to obtain the interleaved position of each block. This method requi- 
res that a common random number generator is implemented in the adapted 
JPEG coder and decoder, or that a generator is available from the system. The 
reported experiments used the random number generator present in Linux. The 
algorithm for random block interleaving is as follows: 

1. Place the sequence number of each image block in an array A. The array will 
have B elements, where B is the number of blocks in the component under 
interleaving. 

2. Feed the initial seed to the random number generator G( ) and set the coun- 
ters N and M to 0. The function G(y) returns a value between 0 and y — 1. 

3. Assign the array index I as: I = N + G(B — N ) 

4. Switch the values at A [/] and A [AT], Increment N by one. 

5. Update the output array O as follows: O [ M ] = A [/] . Increment M by one. 

6. Goto step 3 until M equals B. 

The above algorithm will produce an interleaving that is (pseudo-)rando- 
mized and hence will not produce strides of lost blocks which also have lost 
neighbors as is shown in Fig. Q However, for the case with only one data loss, 
this algorithm may produce a few lost blocks which also has a lost neighbor. 
This is a disadvantage compared to the maximum spreading method which will 
not interleave so that neighbor blocks are lost for one data loss. 

1 If an image contains areas which differ greatly in the amount of encoded data pro- 
duced, this will cause imbalance in the spreading. 
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Fig. 7. Random block interleaved 



Fig. 8. Random block concealed 



Windowed random interleaving. In order to improve the random block in- 
terleaving we have devised a window-based random interleaving which is capable 
of alleviating the neighbor block loss problem for single data losses. The algo- 
rithm described above is modified as follows. First a window-size P is calculated. 
The value P signifies the number of MCUs that map to a loss of size L. If a given 
block after interleaving has any of its neighboring blocks within P MCUs, then 
a data loss could cause a neighbor block loss. In order to avoid this, a test is 
performed in step 4 to see if the block to be interleaved has its above or left 
neighbor located within P MCUs from the MCU to which 0[M] belongs. This 
could cause a loss of size L to include a neighboring block, which is undesira- 
ble. If the block has a neighbor within the window, a new counter is initialized 
N' = N + 1 and steps 3 and 4 above are repeated with N' and a new test is 
performed. This is repeated until a sequence number producing a block without 
any neighbor within the P window is obtained or N ' equals B , in which case a 
placement of the block leading to a neighbor within the window is inevitable, 
and N is used. The value P is dependent on the down-sampling used and can 
be calculated as 



L 



P = 



max {S^, S}Y b , S^ r ) max {S $ , Sg b , Sg r } 8 p 



( 2 ) 
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4.3 Error Concealment 

When a data loss has occurred there will be a number of blocks with missing 
information. If coefficient interleaving has been used, then there will be many 
blocks which have one or a few coefficients missing. If coefficient interleaving 
was not used there will be a few blocks where all coefficient values are lost. 
The error concealment scheme employ the same technique regardless of whether 
coefficient interleaving was used or not. Figures 0 to 0 were produced with co- 
efficient interleaving, and Figs. E3 and 0 show the effect of the described error 
concealment. 

DC coefficient. The DC coefficient contains the average color of the block. If 
this value has been lost it is important to try to interpolate it in a way that 
produces the least visible difference. If the DC coefficient differs too much from 
the correct value, the edges around the block may become noticeable. These 
edges are especially perceptively important since they are straight line artifacts, 
which are not well masked by the visual system. 

We base our reconstruction on a method mentioned in which we refer 
to as the least delta method. This method compares the difference between the 
two vertical neighbor blocks with the difference from the two horizontal neighbor 
blocks. The two blocks which have the lowest difference are averaged to obtain 
the interpolated value. 

AC coefficients. The AC coefficients contain information on the presence of 
different spatial frequency contents in the block. This fact is used when perfor- 
ming the interpolation as described in (3j . The coefficients can be divided in three 
categories: (1) containing mainly horizontal frequency information, (2) contai- 
ning mainly vertical frequency information or (3) containing both. Accordingly, 
coefficients containing information about vertical patterns are interpolated from 
the blocks directly above and below. Correspondingly, coefficients with horizon- 
tal patterns are interpolated from blocks to the left an right of the missing block. 
This can be further enhanced by checking that the maximum difference between 
the two blocks used for interpolation is not too large, in which case the interpo- 
lation instead use four or eight neighboring blocks. For coefficients that contain 
both horizontal and vertical frequencies simple four-way interpolation is used. 

4.4 Measurements 

Numerical results that quantify the quality difference visible in Figs. Q0 are gi- 
ven in Table 0 All values are relative to the original web-quality image shown 
in Fig. 0 and were computed for grayscale versions of the images. The peak 
signal-to-noise ratio (PSNR) is a mathematically derived value often used for 
image comparisons. In addition to the PSNR a perceptual metric based on just- 
noticeable-difference (JND) (211] is also presented. For PSNR high values are de- 
sirable whereas for JND low values are better. The values are image-dependent 
and other images will produce different results. In order to fully evaluate the per- 
formance of the proposed methods more comprehensive tests must be performed. 
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Table 1 . Error concealment performance 



Image 




PSNR 


JND 


Fig. 2 


10% loss 


12.437 


243.487 


Fig. 3 


10% loss, JPEG regular restart 


19.066 


159.023 


Fig. 4 


10% loss, extended restart 


21.802 


134.521 


Fig. 5 


10% loss, maximum spreading interleaved 


23.022 


118.748 


Fig. 6 


10% loss, maximum spreading interleaved concealed 


28.888 


32.254 


Fig. 7 


10% loss, random interleaved 


22.138 


128.095 


Fig. 8 


10% loss, random interleaved concealed 


29.733 


52.807 



5 Conclusions 

By coupling a partially reliable transport protocol to suitable image coding im- 
proved image transfer performance can be achieved in lossy networks. The par- 
tially reliable transport protocol guarantees that at least a specified fraction of 
the data is delivered to the application, and hence a lowest acceptable image 
quality will be enforced. We have designed and implemented a modified JPEG 
coder suitable for such applications. Some of the required modifications, such as 
random window interleaving are new, while others are adoptions of known tech- 
niques. Our recoding results show that JPEG can be adapted to handle packet 
losses in a graceful way. For future work we intend to integrate a compressing 
text coder capable of handling a partially reliable service with the JPEG coder 
and a proxy system. 
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Abstract. As technological advances continue to be made, the demand for more 
efficient distributed multimedia systems is also affirmed. Current support for 
end-to-end QoS is still limited; consequently mechanisms are required to 
provide flexibility in resource loading. One such mechanism, caching, may be 
introduced both in the end- system and network to facilitate intelligent load 
balancing and resource management. This paper introduces new work at 
Lancaster University investigating the use of transparent network caches for 
MPEG-2. A novel architecture is proposed, based on router-based caching and 
the employment of large scale dynamic RAM as the sole caching medium. 
Finally, the architecture also proposes the use of the ISO/IEC standardised 
DSM-CC protocol as a basic control infrastructure and the caching of pre-built 
transport packets (UDP/IP) in the data plane. The work presented in this paper 
is in its infancy and consequently focuses upon the design and implementation 
of the caching architecture rather than an investigation into performance gains, 
which we intend to report in future publications. 



1 Introduction 

The delivery of real time continuous media such as digital audio and video is 
becoming increasingly important in today’s computing environment. However, the 
high data rates and strict delivery constraints that continuous media imposes, have 
proven to be difficult to meet in high demand situations. A wealth of research has 
been carried out over the past ten years to solve these problems, combining 
developments in efficient filing systems, highly optimised scheduling policies, 
admission control and resource management [1]. Whilst research has led to high 
performance servers, there are still complex issues surrounding the end-to-end 
delivery of audio and video across the Internet. The large data size not only places a 
substantial load on the network, but also represents a high cost for video distribution, 
particularly if expensive backbone links are involved in the delivery process. One 
approach to reducing this cost is the use of caching. By inserting cache nodes in the 
local network, popular videos or clips can be serviced from local caches, resulting in a 
reduced network load over the greater distance. Caching also brings additional 
benefits, in that videos streamed from the cache decrease the load on the server, 
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allowing it to service other requests. As caches are located close to clients, there is 
also reduced latency in establishing connection set-ups. 

Advances in hardware have brought about increased dynamic RAM capacities at a 
reduced cost; a pattern that is expected to continue for years to come. Coupled with 
the move to 64-bit architectures, there is now more potential than ever to use RAM as 
a medium for video storage. RAM brings many advantages to the real-time delivery 
of continuous media; not least the elimination of disk I/O bottlenecks and increased 
overall bandwidth, which offer the potential to support a very large number of 
concurrent accesses. When serving a large number of streams, it is suggested that 
RAM provides a more economic solution than disks due to its higher bandwidth 
capabilities [2], 

In this paper we present a cache node using main memory as the primary caching 
medium. In order to maximise throughput performance between RAM and the 
network interfaces, the node uses pre-built IP, which stores video data in main 
memory in a pre-packetised network/transport format, to be inserted into the IP stack 
with minimal processing overhead. Much of the implementation focuses upon the use 
of MPEG-2 as a streaming media type. This is the generally accepted format for 
broadcast quality digital video and is generic in its deployment capabilities. The paper 
is divided into the following sections. Section 2 gives an overview of the caching 
system from an architectural perspective, describing envisaged deployment scenarios 
and integration into the control environment. Continuing, section 3 discusses some of 
the implementation aspects and the realisation of the caching node in Windows 2000. 
Finally, sections 4 and 5 overview some related work in the area and outline some 
directions of interest for future work. 



2 Caching Architecture 

This section provides an overview of the caching architecture, describing envisaged 
deployment scenarios and integration into the control and data environment. 



2.1 Transparent Caching 

Traditional web caching is achieved through the use of a proxy server, which is 
physically located between a client web browser and a server. The proxy intercepts 
all packets, and examines each one in order to determine whether it can service the 
request itself, or whether an additional request has to be made to the server. Proxies 
generally need to be explicitly configured within the web browser for each client, 
which presents a large cost and an unscalable solution for service providers. 

A fundamental objective of the caching node is that it should be completely 
transparent to both video client and server so that no additional cost of ownership is 
incurred, hence making the node suitable for wide-scale deployment. A number of 
approaches to transparent caching were considered, the most common of which use 
L4 switches or policy-based routers. In this case, user requests for web pages are 
diverted by a router/switch to a local cache, and all other network traffic forwarded as 
normal. If the requested data is not available in the local cache, a separate TCP 
connection is established to the web server in order to retrieve (and store) it. The data 
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is then returned to the client, with any subsequent requests for the same data serviced 
from the local storage of the cache. 

Our proposed approach includes IP routing functionality within the cache node itself 
(bearing in mind the cache node is generally positioned at the edge of the network and 
not in the core). This ensures that all requests pass through the node, enabling it to 
filter out requests and cache incoming data. Both the server and client are unaware of 
the presence of the cache, and there are no configuration changes required within the 
router in order to forward requests to the cache. Cache nodes are deployed within the 
local area network close to clients, and do not perform the intensive routing 
operations associated with core routers. 



2.2 Topology 

The basic networking architecture is shown in Fig. 1. A continuous media server that 
can support a number of concurrent stream accesses is connected via a high-speed 
interconnect to a backbone router. The cache node is installed in each LAN, acting as 
an IP router and is directly attached to the backbone router connecting the LAN to the 
WAN. The proposed architecture uses UDP/IP over an ATM-based IntServ/DiffServ 
infrastructure, and assumes negligible packet loss on such connections. We believe 
that UDP is the preferable choice over TCP since it does not provide error correction 
and control, and is connection-less and therefore ideally suited to interception and 
masquerading. 



Wide Area Network 




Fig. 1 . Network Topology 

The caches are able to use pre-built transport units, in this case UDP packets, as a unit 
of caching. This avoids both the need for user level processing and the need for re- 
assembly of data leaving the node. Because of the simplistic and connection-less 
nature of UDP, packets can be easily sent from the cache node in a form of 
masquerading whereby the client is unable to determine the originality of the data and 
thus the caching appears to be transparent. 
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2.3 Integration with Multimedia Data Architecture (MPEG-2) 

The MPEG-2 (Moving Pictures Experts Group) ISO/IEC 13818 standard [3] is 
designed to provide high quality video by exploiting spatial and temporal 
redundancies in the input source in order to achieve compression. MPEG-2 has been 
widely adopted as the standard for digital video, and is used by DVB (Digital Video 
Broadcast), DAVIC and DVD. 

The MPEG-2 systems layer (ISO/IEC 13818-1) specifies two mechanisms for the 
multiplexing and synchronisation of elementary video and audio streams to form a 
single data stream that is suitable for storage or transmission. Each mechanism is 
tailored for a different operating environment. The first scheme, known as a Program 
Streaming, uses large packets of variable sizes and is designed for use in a largely 
error-free environment. The scheme is similar to the MPEG-1 multiplexing standard 
and can support only one program (a number of elementary video/audio streams with 
a common time base). The second scheme, known as a Transport Streaming , can 
combine multiple programs with independent time bases into a single stream. 
Transport Streams use fixed length 1 88-byte packets, with additional error protection 
and incorporate timestamps within the packets to ensure correct synchronisation. 
They are intended for use in error-prone environments. However, because of their 
additional complexity. Transport Streams are more difficult to create and de-multiplex 
than Program streams, and therefore the initial caching architecture focuses upon the 
handling of Program Streams only. 

Program Streams are constructed from one or more Packetised Elementary Stream 
(PES) packets. A PES packet consists of a header and a payload. The header 
contains important timing information in the form of a Presentation Time Stamp 
(PTS) and a Decoding Time Stamp (DTS) which is used to ensure correct 
synchronisation at the decoder. The header also contains a stream_id field in order to 
distinguish one elementary stream from another within the same program. The 
payload consists of the video and audio data bytes that have been encoded from the 
original source stream. In a Program Stream, PES packets are arranged into logical 
groups known as ‘packs’. A pack consists of a pack header, an optional system 
header and any number of PES-packets. The pack header also contains important 
timing information in the form of a System Clock Reference (SCR). 



2.4 Integration with Multimedia Control Architecture (DSM-CC) 

In designing a caching architecture, integration into the media control architecture is 
essential. In this work we have adopted the ISO standardised Digital Storage Media - 
Command and Control (DSM-CC) protocol [4]. This protocol is a specific 
application protocol, intended to provide the basic control functions and operations to 
manage digital storage bit streams akin to MPEG-2. It is designed for the command 
and control of retrieval/storage applications such as video-on-demand, interactive 
video services and electronic publishing. In the proposed caching architecture, DSM- 
CC is used in stand-alone mode (i.e, it is not embedded within the data stream). It is 
encapsulated in PES packets and transmitted over UDP/IP. To initiate simple 
playback a client issues a bit stream select command to a given DSM server. This 
select command carries a bit stream identifier corresponding to an ISO/IEC 13818 
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stream. Providing the server holds the appropriate stream, a select acknowledgement 
is returned (to the sender’s UDP port). The client can now control the stream (play, 
stop, pause, resume and jump) through subsequent retrieve commands. 

Within the caching topology, the cache server intercepts DSM-CC requests and if it is 
able to provide the desired bit stream from cache, then masquerades as the server 
itself (this is possible because of the simplicity of the control architecture), otherwise 
the DSM command is forwarded to the appropriate server. In some scenarios the 
cache may be able to provide a portion of the bit stream (termed a partial hit), in 
which case it is necessary for the cache to send a request to the server itself and then 
‘hand over’ at the appropriate time. Providing that this request is made at the 
appropriate time, the retrieved stream can be optionally cached and then forwarded. 

To determine which UDP packets make up a given bit stream, the caching node must 
assume that all PES packets originating from the DSM server, to a given port and 
address, constitute the same bit stream until an MPEG-2 end code is received. It is 
therefore necessary for the cache to ‘snoop’ the UDP payload during the caching 
process. Where IPv6 is deployable, the cache is able to use the flow label instead, in 
which case the need for snooping is avoided. 



2.5 Caching Behaviour 

Because of the large size of MPEG-2 video objects (typically in the order of a number 
of megabits per second, per film), it is intended that the node cache complete video 
clips and popular portions of large video objects. The behaviour of the cache is 
controlled by a set of policies that dictate what to do in the event of a cache miss, a 
cache hit or a lack of available cache memory. A range of cache replacement policies 
are under consideration for use within the caching architecture. These include 
strategies from traditional caching research such as LRU (Least Recently Used), LFU 
(Least Frequently Used) and LRU-k [5]. However, it is clear from an analysis of 
logged accesses to video data [6] that the initial portion of a video is often used to 
determine a users interest. The importance of a replacement policy in maintaining the 
initial portion of a video within the cache is also affirmed in [7], who propose a prefix 
caching technique that stores the initial frames of popular clips. It is suggested that 
this hides any latency, throughput and loss effects between the server and the proxy, 
and if combined with work ahead smoothing can reduce the variability of network 
resources between the proxy and client. 



3 Design and Implementation 

This section presents an overview of the engineering and implementation of the 
caching architecture. A significant contribution of this work is in the realisation of 
the system within a prototype environment. Although the prototype is still in its 
infancy, already the design and implementation have provided a number of additional 
insights into the problems faced in network caching strategies. 

The prototype cache node is based on extensions to the Microsoft Windows 2000 
Advanced Server operating system and the hardware consists of an Intel L440GX+ 
motherboard, which is specifically designed for enterprise servers. The existing 
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prototype supports 2Gb of main memory (74bit registered ECC DIMMS). In terms of 
network interfaces, the L440GX+ motherboard incorporates an Intel 82559 Fast 
Ethernet (100Mbps) adapter. This is coupled with an additional Intel PRO/IOOB PCI 
Ethernet adapter providing a secondary routing interface. Finally, the cache node’s 
processing power includes dual Intel Pentium III 450MHz processors. 

In terms of client/server networking infrastructure, the prototype cache node is 
connected as a gateway router between two private IP sub-networks. Consequently, 
communications between given client/server nodes are forced to pass through the 
cache. The video distribution application used within the experimental environment 
is a proprietary application providing simple MPEG-2 streaming over UDP/IP. The 
application streams according to in-line transmission rates (pack header 
program_mux_rate ), assembling PES packets into transport payloads. The server 
application is not designed for serving a large client base and does not employ 
techniques such as striping, commonly found in commercial video distribution 
applications. 

At the client side, the payload from the UDP packets is passed from the socket layer 
into a Creative Labs Dxr2 MPEG-2 decoder card, supported by a Linux 2.2 open 
source driver. This client is able to decode and render program streams to both 
overlay and external analogue output. 



3.1 Kernel Mode Processing 

The Windows 2000 operating system is a micro-kernel architecture, based on ideas 
originally founded within the Mach operating system from Carnegie Mellon 
University [8]. This design means that privileged instructions, which are potentially 
dangerous to system stability, are restricted to a very small subset of functions that 
reside within the kernel. Furthermore, access to these is strictly controlled by the 
privilege level of given process or application. In the Intel Processor Architecture [9] 
this protection is managed by one of four privileges levels (also known as ring 0 - 
ring 3). These determine which code and data segments are accessible by the calling 
program, and thus what instruction sets may be accessed. Conventionally, ring 0 
represents the most privileged level, reserved for kernel routines for memory control, 
error handling, task switching etc. Rings 1-3 are reserved for drivers, operating 
system extensions and applications respectively. However, the Windows 2000 
architecture only uses ring 0 for the operating system and ring 3 for applications; rings 
1 and 2 are not used. 

In the Intel implementation of Windows 2000, system calls to kernel services (also 
known as system services) are dispatched through software interrupts, invoked by the 
int2E instruction. This causes a system trap which allows the executing thread to 
transition into kernel mode and execute the respective trap handler. When this 
interrupt occurs, the processor first examines the appropriate descriptor (interrupt 
gate) in the Interrupt Descriptor Table (IDT), and then pushes onto the stack the flags 
register, current code segment and current instruction pointer. Providing that the 
protection constraints of the caller and the gate are agreeable, the appropriate handler 
is called. The handler (in this case _KiSystemService ) verifies and copies the user 
mode stack, and then uses an argument in the register EAX to index into the system 
service dispatch table and execute the respective service. On completion of the 
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services an IRET instruction is issued which pops the instruction pointer, code 
segment and flags, and then resumes execution in the caller code. 

This approach to software protection is effective, but does nevertheless incur some 
overhead (in the order of thousands of clock cycles) since each time a system call is 
made a context switch must occur. Furthermore, Windows 2000 does not use 
hardware multi-tasking (such as found on the Intel processors) and thus context 
switches are executed by software alone and are consequently processor intensive. To 
avoid this overhead, all processing by the cache node is executed within the kernel. 
No interactions with user mode code are made, and thus context switching is avoided. 
To achieve this engineering goal, the caching mechanisms are implemented as a 
kernel module (device driver) which interacts directly with the communications 
protocol stack. 



3.2 IP Interception and Injection 

The Windows 2000 networking support, from the lowest level, consists of a series of 
chained kernel device drivers for physical/net work/transport layer processing, 
followed by a number of user level Winsock2 layers. The Winsock2 layers consist of 
the API itself (provided as ws2_32.dll) in conjunction with one or more network 
Service Provider (SP) layers. These interact with the kernel transport drivers, via the 
operating system’s I/O manager, to provide a proprietary user level transport API to 
the Winsock layer. Generally these layers do not provide any of the core 
implementation of any individual protocol. They do however provide functionality 
that enables the services offered by the kernel drivers to be ‘projected’ into the user 
space, including support for asynchronous and shared I/O. 

As previously discussed in Section fi] the cache node is designed to cache pre-built IP 
packets. Because the core IP protocol processing is provided by the kernel, we are 
able to execute the necessary caching interactions without entering the upper layers. 
To intercept packets, a hook routine is attached to the TCP/IP driver (tcpip.sys). Once 
an IP packet has been assembled from NDIS packets, it is passed to the hook routine 
which then determines, in accordance to caching policies, whether or not to cache the 
packet. This process is carried out on all packets passing through the node. Packets 
which are being forwarded by the node, which in fact make up the majority, are 
passed to the outgoing interface in the form of the NDIS buffers used to assemble the 
original packet. This optimisation eliminates the need to re-fragment into network 
transport units. In theory it would be possible to re-write the network adapter so that 
DMA is used to place data directly into the caching area (this would also require an 
adapter capable of 64bit addressing) and hence avoid the need to carry out any data 
copying (zero-copy). However, the prototype implementation does not support this 
and executes exactly one copy on all cached packets, which is made during the 
interception, rather than injection, process. 

In the reverse direction, packets which have been previously cached must be passed 
back out onto the network as required. This process of ‘injecting’ packets into the 
network is also done solely in the kernel. Before cached packets can be sent back out 
into the network a process of adaptation or ‘moulding’ must take place. The primary 
purpose of this is re-addressing and the adjustment of any checksum fields. New 
address fields for the IP packets are determined from the intercepted DSM-CC 
packets. The initial prototype implementation uses IPv4, whilst adoption of IPv6 is 
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underway. To support IPv4, checksums for both the IP header and the UDP header 
must be re-calculated from the adjusted field, whilst IPv6 only requires re-calculation 
of the UDP checksum (since IPv6 headers do not have a checksum). 

In the case of IPv6, it is also necessary to adjust the packet’s flow label. This is a 28- 
bit value made up of a traffic class field (4 bits) and a flow identifier (24 bits). In the 
initial prototype implementation the flow identifier is randomly generated by the 
cache node and the traffic class fixed at zero. Nevertheless, better use of the traffic 
class and flow identifier are envisaged in future work. Possibilities include selecting 
network QoS according to the requirements of the cached video and using the flow 
identifier to differentiate across multiple streams being received on the same UDP 
port. 



3.3 Cache Entries/Indexing 

This section discusses the general internal organisation of the cache, and the adopted 
strategy for the storage and retrieval of cache blocks. An entry in the cache, known as 
a cache block, is made up of an informational header followed by a number of pre- 
built IP packets, made up of an IPv4 or IPv6 header and a series of MPEG PES 
packets within a UDP payload. Each cache block corresponds to one second of 
presentation time for a given program stream. This interval represents the minimal 
required granularity of media access (seeking to a finer granularity serves no useful 
purpose in Video-on-Demand scenarios). By caching data in blocks of one second 
severe fragmentation is avoided and the scope required by the indexing scheme is 
reduced. 

Cache blocks are uniquely identified for storage and retrieval through a combination 
of bitstream_ID and presentation time (each 32 bits wide). The bitstream_ID 
uniquely identifies a given media stream and is assigned by the DSM-CC server 
which maps them to more meaningful names. The assumption is made that within the 
context of a distributed server architecture, identifiers are co-ordinated accordingly 
and are globally unique to a given stream content. The presentation time, in units of a 
second, is derived from the Presentation Time Stamp (PTS) of the first PES packet in 
the first IP packet of the cache block. In some streams PES packet PTSs may not be 
sequential. This is because different elementary streams, within a program stream, 
may have different decoding latencies, and therefore the presentation-decode time 
mappings are different. To avoid this problem, cache blocks are stored in relation to 
the PTS fields of a specific elementary stream which acts as a reference stream for 
time stamp information. Finally, within the MPEG-2 Systems specification, the PTS 
field is defined as optional, and therefore in its absence (across all elementary streams 
in the program), time stamps are generated by the cache node using either the 
program_mux_rate from the packet header or the local time at which the block is 
cached. 

The proposed caching architecture uses a two level indexing scheme to hold the 
mappings between the bitstream_ID/presentation time and the physical address of the 
cache block. The scheme is strikingly similar to that found in hardware virtual 
memory support such as found in the Intel Architectures. The first index, termed the 
cache index, provides a mapping between the stream identifier and the address 
(virtual) of the cache table. This index has n entries, where n is the total number of 
unique streams distributed by the server(s). Each cache table, per entry in the cache 
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index, consists of an array of physical addresses corresponding to each cache block 
stored in the cach e area for the given stream. An overview of the indexing scheme is 
shown in Fig~2~| The overhead of the indexing scheme is given as: 



total _size_in_bytes = (8 * n) + X(8 * t n ) 



; where n is number of unique streams and t n is total presentation time for stream n. 
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Fig. 2. Cache Indexing Scheme 

Because the scheme provides direct mappings, the resources required by the indices 
may be potentially large. Nevertheless, we are making the assumption that the 
reducing factor is the total number of unique streams concurrently handled by the 
system (in a more dynamic media environment, where content is frequently changed 
and therefore a large number of different streams may exist, a process of re-mapping 
is required). Consequently, the cache node need only maintain a cache index capable 
of addressing the total number of unique streams that may exist in the cache at any 
given time. Furthermore, hardware and cost limitations means that this is constricted 
by the ‘capacity’ of the node. Our initial prototype provides ~2Gb of caching 
memory, capable of storing approximately 68 minutes (4096 seconds) of media. Thus 
the maximum total overhead of the cache tables, assuming an average stream rate of 
4Mbps, is only 32Kb. 
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As mentioned, the proposed architecture is similar to virtual memory mapping 
techniques, offering a minimal direct mapping approach. In further development, we 
envisage the replacement of the soft indexing scheme with 64 bit hardware virtual 
memory management. An alternative to the chosen approach could be to use hash 
tables. However, because the required indexing scope (number of unique streams and 
total length of cached segments) varies significantly, the use of hash tables would 
result in substantial redundancy. 



3.4 Caching Memory Sub-system 

The nature of the caching architecture and its demands of memory management are 
significantly different than the requirements found in general purpose operating 
systems. Allocations, corresponding to individual cache blocks are relatively large 
(ranging from 256K - 1024K). Blocks may exist outside the range of the virtual 
address space, at least within 32-bit architectures, which only provide 2'" virtual 
addresses (4Gb). Furthermore, this problem is exacerbated by the operating system’s 
use of reserved areas, which within the Windows 2000 operating system, only leaves 
somewhere in the region of 512Mb of virtual addresses available for kernel modules. 
In order to fulfil these requirements, the cache node uses a memory sub-system to 
manage the caching area. The prototype memory sub-system includes the following 
features: 

• Large Scale Addressing - support for addressing beyond 4Gb of physical 
memory. The prototype is based on the 32-bit Intel Pentium III architecture 
which includes support for the Physical Address Extension (PAE). PAE 
enables an extension of physical address from 32 bits to 36 bits, thus 
allowing up to 64Gb of main memory. It is supported in the linear address 
translation scheme through the incorporation of an additional indexing table, 
the page-directory-pointer table. This table provides support for up to 4 page 
directories. 

• Dynamic Address Mapping - as a side effect of large scale addressing there 
are generally less virtual addresses available than accessible physical 
addresses. Consequently, the proposed sub-system maps virtual address ‘on 
demand’. This means that when the caching engine wishes to access the 
caching area, to store or locate a block, the sub-system must temporarily map 
a virtual address to the cache block for the duration of the required access. 
Once the engine has finished with a block the virtual address is released. 
The mapping process can also be optimised so that virtual address are only 
released when there are no more available, although it is more efficient to 
release addresses as soon as possible. 

• Modular Sub-System - The memory sub-system is abstracted as a kernel 
module and offers a cleanly defined API. Future developments of the cache 
node are expected to support Intel’s 64-bit (IA64) architecture. The initial 
sub-system implementation uses Intel’s 36 bit addressing, with the intention 
of its seamless replacement with 64 bit support. 
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• Lazy Defragmentation - To avoid fragmentation and thus redundancy 
occurring, the sub-system employs a low-priority worker thread to 
defragment the memory area when blocks are not being accessed. The cache 
blocks are defragmented according to a least recently used metric. 

The memory sub-system uses a simple address ordered free list and a first fit policy 
[10]. Although a single free list is not as efficient as a segregated free list, it does 
provide ample performance for the prototype implementation. We also chose not to 
use a buddy memory management system due to its complexity in coalescing and its 
apparently poor fragmentation. 



3.5 Caching Engine 

The cache engine is activated through the receipt of data on a network port, i.e. via a 
hardware interrupt, which instantiates a kernel thread. In turn this up-calls into the IP 
hook routine which then snoops the payload and caches accordingly. Traffic which 
does not contain PES packets and is not part of stream being cached, is forwarded out. 
Traffic which does contain PES packets, either DSM-CC or data, is passed into the 
caching engine. Here, DSM-CC requests are interpreted and the necessary 
masquerading/proxy actions performed. Alternatively, data PES packets are written 
into the cache according to the system’s caching policies. In the current 
implementation a policy is defined by an individual hook routine. We envisage future 
implementations supporting more dynamic policy specification and loading 
mechanisms. Other than the kernel threads generated by network interrupts; the 
current prototype implementation uses low priority worker threads to execute the de- 
fragmentation. When no fragmentation exists these of course sleep. Currently, the 
caching engine is executed solely in the kernel. Nevertheless, we do envisage the use 
of user-mode processes for general policy management and policy specification. 



4 Related Work 

Research into multimedia caching in the context of buffering and caching within 
multimedia servers exploits the fact that multimedia objects are typically accessed in a 
sequential manner. Blocks retrieved for one client can therefore be reused for 
subsequent requests within a short time interval. One technique to exploit this is 
interval caching, whereby intervals between successive streams (viewing the same 
data) are cached, in order that the subsequent stream avoids disk access [11, 12, 13]. 
Similarly distance caching [14] replaces blocks of data based on the distance between 
successive clients. 

Research into caching multimedia streams within the network stems from the work in 
managing a distributed hierarchical video-on-demand system. The Berkeley VOD 
System [15] is designed to provide transparent access to large amounts of video 
material. Continuous media objects are stored on tertiary storage devices, and only 
copied to a file server when required. The MiddleMan architecture [6] is a collection 
of cooperative proxy servers that, as an aggregate, cache video files within a local 
area network. The design incorporates a coordinator that is responsible for managing 
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the video files stored at each proxy, controlling the storage and replacement of files 
and redirecting requests accordingly. 

In addition to minimising start up latency, caching multimedia streams can also 
smooth the playback of variable bit rate (VBR) video streams. Video streams exhibit 
burstiness due to the encoding scheme and variations within and between frames, 
which can be a problem in terms of buffer management and network utilisation 
(ensuring that a high level of utilisation is achieved). Proxy prefix caching [7] 
overcomes this by caching the initial frames of popular audio/video clips and 
performing work ahead smoothing of variable bit-rate streams in order to reduce the 
resource requirements from the proxy to the client. [16] also proposes a prefix 
caching scheme, but extends this with a selective caching mechanism that caches 
intermediate frames based on the encoding properties of the video stream and the 
users buffer size. It attempts to store frames within the cache that are more critical to 
maintaining the robustness of the stream. [17] describes a technique called video 
staging that pre-fetches and stores selected portions of VBR streams in a proxy. The 
aim is to reduce backbone bandwidth requirements by storing the bursty portions of a 
VBR stream within the proxy and combining them with a constant bit rate (CBR) 
video stream from the server for playout. [18] considers an end-to-end architecture 
for the delivery of layered-encoded streams in the Internet using proxy caches to 
smooth out variations in quality by pre-fetching segments that are missing from the 
cache. 

Much of the work on proxy cache replacement strategies is tailored to HTML 
documents and images, and does not consider the impact of multimedia streams [19]. 
However, [20] provides an investigation into multimedia streaming and cache 
replacement policies, introducing a caching algorithm based upon the resource 
requirements of an object. Finally, [21] provides one of the few investigations into 
the use of main memory for caching, using trace-driven simulations to evaluate the 
performance benefits of main memory caching for web documents. 



5 Further Work 

Work to date is focused on the use of large-scale RAM as a caching medium. It is 
proposed that this be extended to incorporate fast access disk as an additional level to 
a more hierarchical caching approach. However, the cache node’s disk storage will 
only be used as an intermediary before cache content is totally dropped. It will not be 
used as a source medium for streaming. 

Future work will also examine the implications of other media types, in particularly 
ISO/IEC MPEG-4 [22], This is a multimedia format that is object-oriented and lends 
itself to partial caching. We envisage using media ‘objects’ as a unit of caching, and 
the distributed gather of such objects to form a scene. The adoption of such media 
types also opens up other areas of research interest. One such area is co-operative 
caching, where by caching nodes within the network use some proprietary inter-nodal 
protocol to handoff cache requests to other nodes in the event of a local cache miss. 
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6 Concluding Remarks 

This paper presents ongoing work examining the initial design and implementation of 
a network caching architecture for high quality video, and more specifically ISO’s 
MPEG-2. The discussed architecture is based on the notion of a transparent caching 
node, situated at the edge of the network, which is able to masquerade as a remote 
video server to its clients. This transparency lends itself to simple management and 
straightforward integration into the majority of existing network topologies. 

The cache node, based on the Windows 2000 operating system, exploits high-speed 
main memory to cache pre-built UDP/IP packets containing MPEG-2 Program Stream 
PES packets. This technique of using transport/network level data units avoids the 
transport layer re-fragmenting the video data, thus increasing performance. In order 
to deploy such a cache, it is essential that the cache is able to integrate and cooperate 
with the video control architecture. In our prototype implementation, we have chosen 
to support DSM-CC as the basic control protocol. The cache behaves as a DSM 
proxy to the unaware client, intercepting control requests and forwarding/handling 
this as necessary. 

The work presented is in its early stages and concrete evidence of its potential success 
is yet to be gained. Nevertheless, since its commencement, the work has highlighted 
some important issues in the design and implementation of such a scheme and further 
outlined the advantages which may be gained. In our next phase of work we hope to 
carry out more extensive testing of the prototype and give further indications of its 
effectiveness. We also wish to address some unresolved issues such as what are the 
constraints on caching policies and how partial hits and other scenarios should be 
handled. 
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Abstract. In this paper, we propose static and dynamic server selection 
techniques for multicast receivers who receive multiple streams from re- 
plicated servers. In the proposed static server selection technique, if (a) 
the location of servers and receivers and shortest paths between them on 
a network and (b) each receiver’s preference value for each content are 
given, the optimal server for each content that each receiver receives is 
decided so that the total sum of the preference values of the receivers 
is maximized. We use the integer linear programming (ILP) technique 
to make a decision. When we apply the static server selection technique 
for each new join/leave request to a multicast group issued by a recei- 
ver, it may cause server switchings at existing receivers and may take 
much time. In such a case, it is desirable to reduce both the number of 
server switchings and calculation time. Therefore, in the proposed dyna- 
mic server selection technique, the optimal server for each content that 
each receiver receives is also decided so that the total sum of the prefe- 
rence values is maximized, reducing the number of server switchings, by 
limiting both the number of receivers who may switch servers and the 
number of their alternative servers. Such restrictions also contribute fast 
calculation in ILP problems. Through simulations, we have confirmed 
that our dynamic server selection technique achieves less than 10 % in 
calculation time, more than 90 % in the total sum of preference values, 
and less than 5 % in the number of switchings on large-scale hierarchical 
networks (100 nodes), compared with the static server selection. 



1 Introduction 

Multicast is a useful way for saving bandwidth consumption by simultaneous 
transmission of a data stream such as WWW pushing of contents and live video 
streaming to multiple receivers PEj. However, due to the limited bandwidth 
that can be used for multicast traffics, when multiple streams of live video are 
transfered, we need efficient bandwidth control of network resources used by 
each stream. For this purpose, we have proposed bandwidth control techniques to 
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maximize the quality requirements of receivers for unicast and multicast streams 

Gil- 

Regardless of unicast or multicast communication, bandwidth shortage cau- 
sed by multiple streams is due mainly to path length between servers and recei- 
vers, since competition occurs among different streams at some common bottle- 
neck links. To overcome this problem, it may be useful to place some replicated 
servers at remote nodes j5] where video sources are transmitted to the replica- 
ted servers through high-speed links (backbone), and each receiver selects one of 
these servers depending on network traffics, server loads and so on. This techni- 
que improves network utilization without changing underlying routing protocols. 

In recent years, such multi-server techniques have been researched (5BE1- fi 
EJ have proposed unicast server selection techniques based on metric information 
such as packet delay, hop count and server load. 0 has proposed a multicast 
server selection technique where the optimal server assignment for each receiver 
to minimize the total link cost is formulated as a mathematical problem on a 
graph. 0 also gives a heuristic for the dynamic server selection problem where 
the server switching cost caused by join/leave requests to multicast groups is 
considered. However, in multi-media applications using multicast communication 
such as video-conferences at multiple locations, each receiver requires to receive 
more than one stream and his/her preference for each stream may differ from 
others. In general, when a receiver receives a stream which other receivers would 
not like to receive and their path from the server is quite long, the bandwidth 
used by the stream may reduce the benefit of the whole receivers. Therefore, 
on networks where available bandwidth is limited, it is desirable to consider 
each user’s preference for each stream and to maximize the benefit of the whole 
receivers. 

Such optimization can statically be calculated if the set of receivers and their 
preferences to all streams are known in advance. However, in general, join/leave 
requests are repeatedly issued by receivers. In such a case, re-optimization should 
be done dynamically. If we do such optimization for every request, the following 
problems arise. 

1. Calculation time: the problem to select servers is a combinatorial optimiza- 
tion problem. Therefore, large amount of calculation time may be required in 
large-scale networks (we show calculation time against the number of nodes 
in Sectional). 

2. Switching frequency: optimization may force existing receivers to switch the 
current servers of its receiving streams to others even if they do not want 
overhead caused by multicast join/leave requests (join/leave latency and so 
on). 

Therefore, it is desirable to apply the optimization technique to a part of network 
where such dynamic changes happen, reducing the number of server switchings 
at receivers as well as keeping the sum of the preference values higher than a 
reasonable threshold. 

In this paper, we propose static and dynamic server selection techniques for 
multicast streams transferee! from multiple replicated servers. In the proposed 
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static server selection technique, if several replicated servers and each receiver’s 
preference value for each content are given, the optimal server for each content 
that each receiver receives is decided so that the total sum of the satisfied prefe- 
rence values is maximized. In the proposed dynamic server selection technique, 
when a new join/leave request to a multicast group is issued by a receiver, the 
number of server switchings at existing receivers should be reduced, and the 
total sum of the satisfied preference values should be increased. Therefore, in 
our dynamic server selection technique, we limit the number of receivers who 
may switch servers and their alternative servers so that the number of server 
switchings is drastically reduced and the sum of preference values is increased. 

In our static optimization technique, in networks with certain link capacities, 
from given (1) the location of servers and receivers, (2) shortest paths between 
them and (3) each receiver’s preference value to each stream, we construct a lo- 
gical conjunction of linear inequalities which represent the bandwidth constraint 
on each link used by streams. The objective function is set to maximize the to- 
tal sum of preference values of all receivers to streams. Thus, we use the integer 
linear programming (ILP) technique to make an optimal server selection. Here, 
we assume that streams with lower preference values may not be received in case 
of bandwidth shortage. 

In our dynamic optimization technique, for each join/leave request issued by 
a receiver, we also construct linear inequalities, in order to obtain a solution 
where the sum of the preference values is maximized, reducing the total number 
of server switchings. Here, for fast calculation in the target ILP problem, we add 
some constraints to restrict receivers to ones who may suffer server switchings 
and also restrict their alternative servers to the two servers whose multicast trees 
are the closest of all the servers from the requested receiver. 

We have simulated our static and dynamic server selection techniques and 
measured (a) calculation time, (b) the sum of preference values, and (c) the 
number of server switchings in a random topology and a hierarchical Internet 
topology called Tiers topology 0 consisting of LAN, MAN and WAN, where 
SPF (Shortest Path First) routing protocol is supposed. As a result, we have 
confirmed that our dynamic server selection technique achieves less than 10% 
in calculation time, more than 90 % in the sum of preference values, and less 
than 5 % in the number of server switchings, compared with the static server 
selection. 

2 Preliminaries 

A network is modeled as an undirected graph with capacity CAP(e) for each link 
e. Replicated multicast servers (or just servers hereafter) S = {si,...,s m } and 
receivers R = {ri, ...,r n } exist on the network. Each server forwards contents 
C = {ci,...,c p } sent from source nodes to the receivers. Therefore, from the 
receivers, each server can be regarded as a multicast server which has these 
contents. Each receiver can receive each content from one of these servers, and 
specifies a value called a preference value to each content. A preference value 



Fast and Optimal Multicast-Server Selection 



43 



sourcel source2 source3 




Fig. 1. Network with Replicated Multicast Servers 

means how eagerly the receiver wants to receive the content. An example of the 
network model is shown in Fig. [Q 

In Fig. □ source 1, source 2 and source 3 are the source nodes of contents ci, 
C2 and C3, respectively. These contents are delivered to some of the replicated 
servers si, S2, S3 and S4 through the connections with large capacities, si, S2, 
S3 and S4 are the candidates of multicast servers, and each receiver selects one 
of them to receive each content. In the figure, R\ receives C2 from si, and both 
of Ci and C3 from S2. 

In this paper, for given (1) location of servers and receivers, (2) the shortest 
path between each pair of server s,; and receiver rj and (3) the preference value of 
each receiver to each content, we formulate a problem to decide a server for each 
pair of a receiver and a content so that the total sum of satisfied preference values 
is maximized. We call this problem static server selection problem. Note that each 
receiver may not be able to receive all the contents that she/he required, due to 
bandwidth constraints. 

Then we consider the case that receivers dynamically start or stop receiving 
contents, that is, receivers join/leave multicast groups. For such a case, we can 
optimize the total sum of preference values by solving the static server selec- 
tion problem when every join/leave request is issued. However, we cannot avoid 
suffering (a) the exponential growth of computation time and (b) the overhead 
of server switching at almost all receivers. Regarding (a), we have experienced 
simulation on networks with 100 nodes, 50 receivers, 10 servers and 5 contents, 
and it took 200 seconds in average and more than 400 seconds in the worst case 
to get an optimal solution on an average machine (Pentium III, 500MHz). Such 
durations are allowed if, for example, we design the total layout of multicast trees 
before a new continuous service is started, however, not feasible for each small 
change of receiver status. Regarding (b), if we consider that the overhead increa- 
ses proportional to the number of server switchings, it is much better to reduce 
it. For these reasons, in this paper, we propose another optimization technique 
called dynamic server selection for each join or leave request of a receiver. In 
Sections 0 and 0 we describe the static and dynamic server selection problems, 
respectively. 
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Let us define the following terms. 

— path(si,rj ): the set of links on the shortest path from s, to r : j 

— uplink(ei, Si,Vj): the up-link (next to ei) on path(si,rj) 

— endlink(si,rj ): the bottom-link (attached to rj) on path(si,rj) 

— pref(rj,Ck)'- the preference value of r, given to c k 

— bw(ck)' the transmission rate of c k 

— (sj,Cfc): the multicast group where c k is delivered from S; 



3 Static Server Selection Problem 

In order to formulate the static server selection problem, we define the following 
two types of boolean variables. Each variable in one type represents the fact 
that a receiver can receive a content from a server. Each variable in another 
type represents the fact that the multicast tree of a content from a server uses 
a link. 

— rcv[si , rj, Ck\ '■ its value is one only if receiver Tj receives content c k from Si, 
otherwise zero. 

— deliver[ei, Si , c k \ : its value is one only if content c k from server Si is delivered 
through link e;, otherwise zero. 

Using these variables, the static server selection problem can be formulated as 
the following integer linear programming (ILP) problem. 



max EEE pref{rj,c k ) ■ rcv[si, Tj,c k } (1) 

i j k 

subject to: 

Y^rcv[si,rj,Ck] < 1, Vj, k (2) 

i 

deliver[ei, Si, c k ] < deliver[uplink(ei, Si,rj), Si, c k \, V?', j, k, l (3) 
rcv[si,rj,Ck] < deliver[endlink(si,rj),Si,Ck\, Vi, j, k (4) 

EE bw{ck) ■ deliver [ei, Si, c k \ < C AP(ei), Ml (5) 

i k 



Objective function m represents the total sum of all receivers’ preference 
values. Constraint (0 states that one receiver selects at most one server for each 
content. Constraints Q and 0) concern the form of multicast trees and indicate 
that if rj receives c k from S;, the multicast tree of c k from Si must contain the 
shortest path from Si to rj. Constraint 0 is a bandwidth constraint on each 
link. 



Fast and Optimal Multicast-Server Selection 



45 



4 Dynamic Server Selection Problem 

Due to the dynamic behavior of receivers in multicast communication, members 
in a group are not unique throughout a session. Therefore, fast re-optimization 
for each join/leave request of a receiver is desirable. 

Let rcv'[si,rj,Ck] denote the fact that receiver r 3 is currently joining the 
group ( Si,Ck ) (its value is one if rj is joining the group, zero otherwise). The 
join/leave behavior of a receiver is described as follows. 

— join: a receiver r q where '}2 i r cv'[si,r q , Ck] = 0 wants to join one of the 
groups (si,c fc ), (s 2 ,Cfe), and (s m ,c fe ). 

— leave: a receiver r q where rcv'[si , r q , Ck ] = 1 wants to leave the group (sj, Ck)- 

We limit the number of receivers who are forced to switch their servers to 
others, in order to prevent a receiver’s join/leave behavior from affecting all the 
receivers spread in wide-area networks. We also limit the number of possible 
alternative servers (servers to be switched) when a receiver switches its server 
of a content so that the receiver does not select servers far from him/her. On 
assuming such restrictions, we formulate the dynamic server selection problem. 



Server 




Fig. 2. Grafting Distance 



[join] We define a grafting distance as the number of links on the shortest path 
from a server to a receiver through which a content Ck from the server has not 
been delivered yet (i. e., the number of links where the content would be started 
to deliver when the receiver joins the group). An example is shown in Fig. 0 We 
adopt the following policy. 

— receiver r q selects one of the two servers (say Sq and Si 2 ) with the shortest 
two grafting distances. 

— For each path^Si' ,rp) which shares some of links with path^s^^q) or 
path(si 2 , r q ) 1 if receiver rp is receiving content Ck from server rp is one of 
the receivers who may switch servers. As an alternative server to receive Ck, 
rp may select one of the two servers with the shortest two grafting distances. 
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Intuitively, each receiver whose receiving streams may compete with r q ’s new 
stream may have to switch servers. Furthermore, we limit the number of their 
possible alternative servers to only two for each pair of a receiver and a content. 

The dynamic server selection problem by each receiver’s join request is an 
ILP problem with the same objective function ©, the same constraints © © 
as in Section 0 and the following constraint to fix the status of receivers who 
should not switch servers: 

rcv[si,r j: c k \ = 1, V(i, j,k) ^ {%' , / , k') (6) 

where ( i',j',k ') is a tuple satisfying the following constraint. Note that all the 
paths are given and rcv'[si,rj,Ck] in the constraint have already been decided. 
Therefore such tuples (■i / , j'. A: 7 ) are uniquely determined. 

( pat/i(si i; r g ) H path(sii ,Tj') 0 V path(si 2 ,r q ) fl path(si' ,Tj') y^ 0) 

A rcv'[si',rj',Ck'] = 1 



[leave] We adopt the following switching policy when r q leaves group (sj,Cfe). 

— For each path{si ' , ry) which shares some of links with path(si,r q ), rj' can 
select Si' as the server to receive content Ck which r y has not received. 

The dynamic server selection problem by each receiver’s leave request is an 
ILP problem with the same objective function flB> the same constraints © © 
and the following constraint to fix the status of receivers who should continue 
to receive streams. 



rcv[si,rj,c k ] = 1, if rcv'[si,rj,c k ] = 1 (7) 

In Section 0 we have measured the performance of dynamic server selection 
for a receiver’s join request compared with the static server selection, in terms of 
the computation time, the total sum of satisfied preference values and the total 
number of server switchings. 

5 Simulation 

We have used two types of networks based on (a) Tiers Model 3 (Fig. © and 
(b) Random model (Fig.©. Tiers is a hierarchical model organized by three 
domains, LAN, MAN, and WAN. For Tiers model, we randomly decided the 
number of nodes contained in LAN, MAN and WAN. We also decided the link 
capacities of LAN, MAN and WAN so that they are in the ratio of 1 : 10 : 100. 
Then we simulated 5 times on the networks varying the number of nodes |lVj. 
For Random model, we randomly decided each link capacity based on Gaussian 
distribution, and simulated 20 times on the networks varying |1V|. Also, we had 
|Sj = 0.1|lVj servers, |I?| = 0.5|fV| receivers and |Cj = 5 contents in the networks 
and selected receivers’ preference values from 25, 16, 9, 4 and 1, randomly. We 
simulated the dynamic server selection for a receiver r|#| who tried to join a 
group in the situation that the server selection had been already optimized for 
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Fig. 3. Network Topology (a) Tiers Model Fig ' 4 ‘ Network To P ol °gy ( b ) Random 

w Model 

the receivers tt, ..., by the static server selection (this simulation is denoted 

by (d) ), and the static server selection for the receivers (denoted by 

(s) ). Then we have measured the calculation time, the total sums of satisfied 
preference values and the numbers of server switchings of (d) and (s) . The results 
are shown in Section l5.ll Also in order to examine the validity that the number 
of alternative servers is limited to 2, we have also measured these values in the 
dynamic server selection with the different number of alternative servers IASI, 
(d-1) |AS| = 0.25|S|, (d-2) \AS\ = 0.5|S| and (d-3) |AS| = 0.75|S|. The results 
are shown in Section 15.21 



5.1 Comparison of Static and Dynamic Server Selection 




Number of nodes Number of nodes 



Fig. 5. Number of Nodes vs. Calculation Fig. 6. Number of Nodes vs. Calculation 
Time: (a) Tiers Model Time: (b) Random Model 

Calculation Time We show the calculation time of the static server selection 
and the dynamic server selection. We varied \N\ (the number of nodes) from 28 
to 107 by every 3 nodes on Tiers model (Fig. Q and from 10 to 30 by every 2 
nodes on Random model (Fig. El) - In these graphs, the plots of the average time 
of (s) the static and (d) dynamic server selections are connected by dashed and 
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solid lines, respectively. For each number of nodes, the range between the worst 
time and the best time is also shown by a vertical line. 

We find the exponential increase of the calculation time in the static server 
selection around 90 nodes or higher on Tiers model and around 28 nodes or 
higher on Random model. On the other hand, we find the linear increase of the 
calculation time in the dynamic server selection. From these results, we can say 
that the proposed dynamic server selection can solve the problem within a rea- 
sonable time. Especially on Tiers model, the calculation time is just 10 seconds 
on 70 nodes in the worst case. On Random model, it took more calculation time 
in the static server selection. This is due to the distribution of distances between 
servers and receivers. On Tiers model, since the distances to servers largely dif- 
fer from each other, the number of the candidate servers may be reduced in the 
calculation process. On the other hand, the distances are very close to each other 
on Random model. 

The Total Sum of Preference Values. We have measured the total sum of prefe- 
rence values of the dynamic and static server selections. The ratios of the former 
to the latter on Tiers model and Random model are shown in Fig. □ and Fig.0 
respectively. On Tiers model, even in the worst case, the ratio is 89% and the 
average ratio is 95%. They are good enough to consider the tradeoff between 
the calculation time and optimality of satisfied preference. On Random model, 
there are a few worst cases that the ratios are in the range of 50%^60%. This 
is because many receivers’ alternative servers are converged to a few servers. 
However, the average ratio is kept more than 95%, therefore our dynamic server 
selection can keep high optimality compared to the static server selection on 
both models. 



100% r 
95% - 
90% - 
85% - 
80% - 
f 75% - 
70% - 
65% - 
60% - 
55% - 
50% - 



Fig. 7. Number of Nodes vs. Total Sum of Fig. 8. Number of Nodes vs. Total Sum of 
Preference Values: (a) Tiers Model Preference Values: (b) Random Model 



The Number of Server Switchings We define a new variable switch[si, rj, c k \ for 
each set of variables s*, r ? - and Ck where rcv'[si,rj,Ck\ = 1 as follows. 



switch[si,rj,c k ] = 1 - rcv[si,rj,c k \ 



( 8 ) 
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Fig. 9. Number of Nodes vs. Number of Fig. 10. Number of Nodes vs. Number of 
Server Switchings: (a) Tiers Model Server Switchings: (b) Random Model 

switch[si,rj,Ck] represents the fact that rj stops receiving Cfc from .s 7 > Thus we 
can represent the number of server switchings as follows. 

EEE switch[si,rj,Ck] (9) 

i j k 

We have measured the number of server switchings. We show the results on Tiers 
model (Fig. EU and on Random model (Fig. fTTljl . 

In the static server selection, the server switchings occurred 60 times on 
average on Tiers model with 100 nodes. Since we decided \R\ = 0.5]iV|, each 
receiver has at least one server switching in estimation. It is too much overhead in 
consideration of multicast join/leave latencies. On the other hand, the maximum 
number of server switchings is largely reduced in the dynamic server selection. 

5.2 Effect of Number of Alternative Servers 

In order to examine the effect of the number of alternative (selectable) servers 
to the calculation time, the total sum of satisfied preference values and the 
number of server switchings, we have measured these values on Tiers model in 
the dynamic server selection with the different number of alternative servers 
| AS\, (d-1) IRS'! = 0.25|S|, (d-2) |AS| = 0.5|S| and (d-3) \AS\ = 0.75|S|. 

Calculation Time We show the average of the calculation time in Fig. El Com- 
pared with (d) where |^4 7 S'| = 2, we find the feature of divergence in (d-3), not 
so much as (s). Therefore we can say that our policy to limit the number of 
alternative server is adequate enough. 

The Total Sum of Preference Values We have measured the average and worst 
of the total sum of preference values and shown the ratios of (d), (d-1), (d-2) 
and (d-3) to (s) in Fig. El leaver age case) and Fig. El (worst case), respectively. 
The ratios of (d-1), (d-2) and (d-3) are greater than (d) and are kept more than 
95% in the worst case. However, in the average case, (d) achieved almost the 
same values as (d-1), (d-2) and (d-3). 
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Fig. 11. Number of Nodes vs. Calculation Time on Tiers Model 
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Fig. 14. Number of Nodes vs. Number of Server Switchings on Tiers Model 

The Number of Server Switchings We show the number of server switchings in 
Fig. El The behavior of (d-1), (d-2) and (d-3) is similar to (s), while it is kept 
low in (d). From the results above, we can say that out policy to limit the number 
of alternative servers is adequate enough. 
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6 Conclusion 

In this paper, we have proposed static and dynamic replicated server selection 
techniques for multiple multicast streams. In the proposed static server selection 
technique, if the location of servers and receivers and the shortest path between 
each pair of a server and a receiver on a network and each receiver’s preference 
value for each content are given, the optimal combinations of the servers and 
the contents for each receiver are decided so that the total sum of the preference 
values of the receivers is maximized. We use the integer linear programming 
(ILP) technique to make a decision. Furthermore, in our dynamic server selection 
technique, the combinations of the servers and contents for each receiver are 
decided so that the number of server switchings is reduced and the total sum 
of the preference values is kept high, by restricting receivers who may suffer 
server switching and also alternative servers to be switched. Such restrictions 
also contribute fast calculation in ILP problems. Through simulations, we have 
confirmed that our dynamic server selection technique achieves less than 10 % in 
calculation time, more than 90 % in the sum of preference values and less than 
5 % in the number of switchings, compared with the static server selection. 

As our future work, we plan to design and implement an architecture to let 
receivers select optimal servers in existence of replicated multicast servers, based 
on the proposed method. We consider that our technique can be incorporated 
into application-layer anycast §5]. Application-layer anycast is an implementation 
of anycast at an application level, and is organized from ADN (Anycast Domain 
Name). An ADN server provides location service and replies to client’s request 
with the list of servers that can provide the requested service. The ADN server 
can select those servers based on certain metrics. Therefore if the ADN server 
knows certain network information needed for our server selection technique, 
calculating the optimal allocation of servers will be possible on the ADN server. 
However, we have to consider the following two problems. The first one is that 
ADN servers reply to only the receivers who requested services. Therefore, we 
need an additional mechanism to let the other receivers switch their servers. 
The second one is how to collect the information. We are now investigating an 
efficient way to realize these requirements. 

Moreover, in order to show the feasibility of the receiver/server limitation 
policy adopted in our dynamic server selection, we will try to analyze the upper 
bound of the solution (that is, the optimality of the solution of dynamic server 
selection) compared with the static server selection on Tiers model. 
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Abstract. The Internet explosion is driving the need for distributed 
multimedia presentations (DMPs), which provide multiple users with 
QoS-controlled multimedia services such as media distribution and vir- 
tual classroom. The provision of DMPs is usually based on the multi- 
cast communication. However, the jitter phenomenon over the best-effort 
Internet always disturbs the orchestration of multimedia presentations. 
Furthermore, the characteristics of multiple media streams combining 
the multicast delivery complicate the network and system deployment. 
In this paper, we describe the major considerations and techniques of 
designing multiple-stream multimedia presentations in a multicast com- 
munication environment. Based on the proposed control schemes, we 
develop a communication engine named Mcast. Mcast (i) provides the 
flexible authoring tool to allow users to author a multiple-stream multi- 
media presentation in a multicast environment and (ii) achieves smooth 
multimedia presentations with the well-designed temporal control me- 
chanism. 



1 Introduction 

With the advances of computer and communication technologies, distributed 
multimedia presentations (DMPs), e.g., video distribution and distance lear- 
ning, becomes more and more popular. DMPs can be characterized by the in- 
tegrated multicast communication and presentation of multiple continuous and 
static media streams pj . Multicast communication simultaneously transmits me- 
dia to multiple recipients, each of whom has the same multicast address Pj- A 
continuous medium, such as video or audio, is a time-dependent medium that 
possesses temporal relations between media units; a static medium, such as text 
or still image, is a time-independent medium that has no temporal relation bet- 
ween media units |Q. Temporal relations of a multimedia presentation can be 
pre-defined and be scheduled in a pre-orchestrated multimedia presentation. The 



H. Scholten and M. van Sinderen (Eds.): IDMS 2000, LNCS 1905, pp. 53-^^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 



54 



C.-M. Huang and H.-Y. Kung 



goal of a DMP is to present all composed media streams according to the tem- 
poral presentation schedule. 

Clients of a multicast group, which scatters on separated LANs, also possess 
diverse capabilities of processing multiple media streams and have distinct avai- 
lable bandwidth in the paths leading to them. Multiple streams mean that media 
streams are retrieved from media bases and are transmitted via different commu- 
nication channels with diverse QoS (Quality-of-Service) requirements [II I j . For 
transmitting multiple streams, continuous media are usually based on UDP due 
to the tolerance of media loss and the real-time requirement; static media are 
usually based on TCP or RMTP (reliable multicast transport protocol) due to 
the reliability requirement [5- Heterogeneous communication protocols induce 
serious temporal anomalies between media streams. That is, different network 
devices and links, which have diverse communication capabilities and network 
traffic situations, always disturb the pre-defined temporal relations and severely 
degrade the QoS satisfaction degrees of the clients in a heterogeneous multicast 
environment 0. In order to compensate for temporal anomalies, multimedia 
synchronization is the essential requirement to achieve smooth coordination and 
cooperation among various media. 

Multimedia synchronization is used to coordinate, schedule, and present me- 
dia units in the distributed environment [5j. Two types of temporal synchro- 
nization are the intra-medium synchronization and inter-media synchronization 
JQ. Intra-medium synchronization ensures intra- medium temporal relation of a 
medium stream and compensates for jitter, which is the asynchronous anomaly 
between consecutive media units of a medium stream [|9 j . Inter-media synchro- 
nization ensures inter-media temporal relations among related media streams 
and compensates for skew, which is the time difference between related media 
streams |2|. With temporal synchronization mechanisms, DMPs achieves smooth 
presentations. 

A well-designed DMP essentially consists of two components that are the 
authoring system and the temporal control system [2]. The authoring system 
is the generating mechanism of behavior specifications. Behavior specifications 
represent (i) the media attributes of related streams, which includes involved 
media streams, temporal and spatial attributes of each medium stream, and 
(ii) the temporal relationship between related media streams. The authoring 
system allows a user to specify behavior specifications of the corresponding mul- 
timedia presentation. The temporal control system is the synchronization and 
presentation mechanisms that achieve temporal relations specified in the cor- 
responding behavior specifications. In order to accurately specify the behavior 
specifications of multiple streams and to perceptively achieve multimedia syn- 
chronization/presentation, a communication engine named Mcast is proposed 
and developed in this paper. 

Mcast provides an authoring tool to specify the media behavior specifications 
and the required multicast environment. The temporal control system of Mcast 
is composed of (i) the synchronization mechanism to achieve multiple-stream 
multimedia synchronization based on the proposed synchronization schemes, and 
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(ii) the presentation mechanism to achieve a perception presentation based on 
the proposed presentation schemes. System developers can use Mcast to develop 
DMPs over a multicast communication environment. 

The rest of the paper is organized as follows. Section 2 describes the net- 
work and system architecture of Mcast. Section 3 describes the proposed control 
schemes for resolving multiple-stream multimedia synchronization/presentation. 
Section 4 evaluates the performance of Mcast. Finally, Section 5 concludes this 
paper. 

2 Network and System Architecture of Mcast 

The proposed multicast multimedia network, which is called Multicast Multime- 
dia Communication Network ( M 3 CN ), is a two level hierarchical architecture 
that spans a distributed environment. The M 3 CN consists of a WAN and a lot 
of LANs that are attached with the WAN. Each LAN is composed of a local 
Multicast MultiMedia Server (M 3 server) and clients. An M 3 server transmits 
media units to hosts via LAN or (and) WAN. Clients of a presentation group, 
which scatter over different LANs, present the same multimedia resource simul- 
taneously. 

In M 3 CN , the concept of a “virtual server” is adopted. A virtual server 
receives media units from the “physical server”, which owns the presentation 
resource, and re-transmits them to end clients. The virtual server is a local ser- 
ver of a LAN and compensates for WAN anomalies by means of pre-depositing 
some media units and suitable synchronization schemes. The concept of virtual 
servers can simplify the overhead of synchronization control in clients because 
WAN anomalies are compensated and media streams are synchronized at vir- 
tual servers. Clients become simpler and low-end, e.g. a Set-Top-Box, a diskless 
networking PC, or a networking TV. 

Figure Q] depicts the network and system architecture of Mcast. The under- 
lined network communication is based on MBone |3|, which provides multicast 
transmission across Internet. The authoring tool is the authoring system. The 
temporal control mechanism is composed of the presentation information file, 
the code generator, and the multicast presentation system. An author uses the 
authoring tool to specify temporal and spatial attributes of media and to author 
his multimedia presentation. Temporal attributes can denote thirteen temporal 
relationships m- In order to represent the temporal relationships, we propose 
a temporal definition to specify temporal attributes of a multimedia presenta- 
tion in Section 4. After specifying media attributes, the authoring tool generates 
the presentation information file accordingly. The presentation information file 
contains a set of data structures that record the spatial and temporal attributes 
of the corresponding multimedia presentation. Based on the presentation infor- 
mation file, the code generator generates the corresponding C codes to compose 
the multicast presentation system. 

The multicast presentation system is composed of the physical server system 
(PSS), the virtual server system (VSS) and the client system (CS). Main func- 



56 



C.-M. Huang and H.-Y. Kung 




Fig. 1 . The abstract network and system architecture of Mcast. 



tions of the PSS are (i) to store media resources that are requested by the virtual 
servers, (ii) to specify the presentation schedule that contains temporal and spa- 
tial relations of the related media streams, and (iii) to multicast requested media 
to the virtual servers. The PSS is composed of three system layers. The mul- 
timedia authoring layer provides the authoring tool to allow users to author 
the multiple-stream multimedia presentation and to generate the corresponding 
presentation schedule. A system manager can specify the communication confi- 
guration that contains the multicast group address, and communication ports. 
The rate control layer is responsible for retrieving media from the media bases 
and transmitting these media. The rate control layer is composed of two kinds of 
components, which are Actors and Synchronizer components. An Actor controls 
a medium stream. An Actor retrieves media units and multicasts these media 
units to the corresponding Actors of the virtual servers. The Synchronizer co- 
ordinates the rate control among Actors. The communication layer achieves (i) 
UDP multicast, which provides more efficient transmission and is used to trans- 
mit continuous media, and (ii) RMTP (Reliable Multicast Transport Protocol) 
multicast, which provides reliable multicasting and is used to transmit static 
media and the presentation schedule. 

The VSS is a proxy. After receiving media units, the VSS stores them in 
the media buffer temporarily. According to the presentation schedule, the VSS 
multicasts media units to the designated group members using the proposed syn- 
chronization control, which compensate for WAN network anomalies. The VSS 
is composed of two system layers. The streaming control layer is responsible 
for multiple-stream synchronization and streaming transmission. The streaming 
transmission achieves continuous and steady multicast transmission. An Actor 
receives and transmits the media units of a medium stream and controls the 
medium flow. That is, an Actor controls intra-medium synchronization. The 
Synchronizer achieves the inter-media synchronization to compensate for skew 
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anomalies and to coordinate Actors’ behaviors. The communication layer trans- 
mits the continuous and static media via the UDP and R.MTP connections, 
respectively. 

The main function of the CS is to achieve a smooth multimedia presentation 
using proposed synchronization and presentation controls. The CS is composed 
of three system layers. The presentation layer is responsible for presenting a 
multiple-stream multimedia presentation. The orchestration layer compensates 
for jitter and skew anomalies. The Actor achieves the intra- medium synchro- 
nization and the presentation control. The Synchronizer achieves the inter-media 
synchronization among Actors. The communication layer is responsible for re- 
ceiving media from the corresponding connection channels. 

In brief, the steps of preparing a multicast multiple-stream multimedia pre- 
sentation using Mcast are as follows, (i) A user constructs the desired presen- 
tation schedule using the authoring tool, (ii) The code generator generates C 
codes for part of synchronization/presentation control according to the presen- 
tation schedule, (iii) The system manager compiles the generated C codes with 
the kernel of Mcast to generate the complete executable C codes for the desi- 
red multimedia presentation, (iv) The complete executable C codes contain the 
physical server codes, the virtual server codes and the client codes. 

3 Resolutions of Multiple-Stream Multimedia 
Synchronization and Presentation 

A multiple-stream multimedia presentation always contains several kinds of me- 
dia, such as video, audio, text, and image. Each medium stream owns its pre- 
sentation schedule and may have related temporal relations with other media 
streams. In order to have a consistent presentation, clients of a multicast group 
have to present media streams according to the presentation schedules as much 
as possible. However, the diversity and heterogeneity of multicast environments 
inevitably disturb the temporal relations of media streams, i.e., disturb the pre- 
sentation schedules of media streams. Therefore, clients synchronize media stre- 
ams with suitable synchronization schemes. The issues of achieving multiple- 
stream multimedia synchronization at the presentation layer are as follows, (i) 
Designers have to specify the related synchronization points between/among 
multiple streams. According to these specific synchronization points, accurate 
presentation schedules that define temporal relations of multiple streams are 
derived, (ii) Based on the presentation schedules, designers adopt suitable and 
practical synchronization/presentation schemes to achieve a perception DMP as 
much as possible. 

In this Section, we specify the temporal distinction of a multiple-stream mul- 
timedia presentation in order to induce adequate synchronization points. Then, 
we describe the proposed synchronization and presentation schemes, which are 
suitable for diverse media streams. 
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Time 



Fig. 2. The presentation time bar chart of an illustrative multimedia presentation. 



3.1 Types of Synchronization Points 

Figure El depicts a presentation example with a time-bar chart. In the presen- 
tation example, there are four kinds of media streams, including video, audio, 
text, and image. Each medium stream is presented or is idle during some time 
periods and may be related with other streams. In order to identify the tempo- 
ral relations among multiple media streams, we propose three kinds of temporal 
synchronization points, which are the stages, sections, and segments. 

1. Stage. A presentation stage is a semantic cut of a multimedia presentation. 
For example, let the multimedia presentation be CNN news broadcast about 
the chess race between world chess champion Gary Kasparov and supercom- 
puter Deep Blue. Figure 0 depicts the presentation as follows. (1) The news 
reporter reports the news about a chess race between Gary Kasparov and 
Deep Blue. The news reporter’s audio, Gary Kasparov’s video, and the re- 
lated news texts and images are presented. (2) Gary Kasparov thinks and 
moves a piece. Then, the video of chess explanation, the texts, and the ima- 
ges about the introduction of Gary Kasparov are presented. (3) An agent 
moves the piece according to Deep Blue’s determination. The background 
music and some auxiliary texts and images are always presented. Thus, the 
presentation depicted in Figure 0 is divided into three stages. At the com- 
mencement of each stage, inter-media synchronization among related media 
streams is required to achieve a consistent presentation. 

2. Section. A presentation section represents that some media objects have 
temporal relations. Steinmetz specified the temporal relations with thirteen 
different relations, which are the equal, start, before, meet, during, overlap, 
finish relations, and their reversed relations m- Based on these possible 
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temporal relations, a presentation stage can be specified into several pre- 
sentation sections. One medium object’s presentation in a section depends 
on another medium object’s presentation status and a cut point between 
two presentation sections is a synchronization point. For example, the text 
of news and the video of Gary Kasparov appear when a specific audio is 
presented. As depicting in Figure |2 the presentation of the audio object A1 
starts the presentations of VI and T1 at synchronization point ti, i.e. , at the 
end (commencement) of section 1 (section 2). The object VI finishes and the 
objects T 2 and 1 2 start at synchronization point f 3 , i.e., at the end (com- 
mencement) of section 2(section 3). At the commencement of each section, 
an inter-media synchronization control is required to achieve a consistent 
presentation. 

3. Segment. In a presentation section, it is possible that a medium stream 
has no medium object presented during some time periods. A presentation 
period is denoted as an active segment; an idle period is denoted as an idle 
segment. In Figure El the text medium has two segments in section 2. The 
first segment displays the text object Tl and the idle segment lasts d\ time 
units. With the help of presentation segments, the presentation section can 
be resolved. 

Based on the concept of stages, sections, and segments, we (i) clearly specify the 
temporal relations among multiple streams and (ii) develop the authoring tool 
and the temporal control system of Mcast. 

3.2 Synchronization Control Scheme 

Based on the concept of different synchronization point types, we propose the 
stage-master-based synchronization scheme to solve the multiple-stream mul- 
timedia synchronization problem. The stage-master-based synchronization is a 
refinement of the master-based scheme, which is adopted by Yang et al. Ha- 
Yang demonstrates that the audio stream is always the master stream since hu- 
mans are more sensitive to variations in audio. However, in some presentation 
examples, the audio stream is not always available during the whole presen- 
tation, e.g., a piece of silent news. During this silent period, the inter-media 
synchronization control can not be achieved due to the absent master stream. 

In order to solve the problem of the absent master stream, each presentation 
stage is associated with a master stream to coordinate the presentation and the 
master stream is changeable between stages. Based on the stage-master-based 
synchronization scheme, the master stream dominates the commencement and 
finish of media presentations within the presentation stage. (1) If a slave stream 
finishes its presentation earlier than the master stream at a synchronization 
point, the slave stream has to block or extend its presentation until the master 
stream finishes its presentation. (2) When the master stream finishes its presen- 
tation at a synchronization point, the late slave streams have to discard media 
units to keep pace with the master stream. (3) The master stream is changeable 
from one stage to the other stage. 
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Fig. 3. An example of (a) the time-oriented presentation control, and (b) the content- 
oriented presentation control for a video stream. 



In Mcast, the criterion of selecting a master medium in a stage is based on 
the human sensitivity about the media. The general principle is as follows. (1) If 
audio exists in a stage, an audio stream is the master; (2) if audio is absent in a 
stage, a video stream becomes the master; (3) if there is no continuous medium 
stream, one of the static media streams is selected to be the master. In Figure 
□ the audio stream Al, the video stream V2, and the audio stream A2 are the 
masters of the stages 1, 2, and 3, respectively. 

3.3 Presentation Control Scheme 

In order to compensate for jitter anomalies, one can adopt the blocking scheme 
for the audio medium and the non-blocking scheme for the video medium to 
achieve intra-medium synchronization. FigureOJa) depicts an illustrated example 
for the video stream. When medium unit fc was presented at time f, the next 
medium unit fc + 1 should be presented at time t + 9 , where 6 is the presentation 
duration of a medium unit. Unfortunately, medium unit fc + 1 does not arrive 
on time. Hence, medium unit fc is re-presented at time t + 9 according to the 
non-blocking scheme. During the time of re-presenting medium unit fc, medium 
units fc + 1 and fc + 2 arrive before time t + 29. At time t + 29, the “expected” 
medium unit is presented. Should the “expected” medium unit be unit fc + 1 or 
unit fc + 2? In order to solve the above problem, two presentation schemes that 
are considered: (i) the time-oriented and (ii) the content-oriented schemes. 

If the main concern is (i) to satisfy time-related temporal relations and (ii) 
to keep the actual presentation time length equal to the nominal presentation 
time length as much as possible, the time-oriented presentation scheme can be 
adopted. The “expected” medium unit should be the one that is closest to the 
nominal one. In Figure EK a ), medium unit fc + 2 is presented at time t + 29 
and medium unit fc + 1 is discarded. The mathematic formula for obtaining the 
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expected medium unit is as follows. At time t + iO , we assume that (1) the 
last presented medium unit is medium unit k + j, and (2) the received queue 
contains media units k + r i, k + r 2 , and k + r n . Let the expected medium 
unit be k + m at time t + iO. Then, m is the maximum number of the subset 
of (j, r\, f 2 , r n ), in which all of the elements in the subset are less than or 
equal to i, i.e., m = MAXIMUM{x € A | x < i},A = {j,r±,r 2 , The 

time-oriented scheme is suitable for the continuous media. The drawback of the 
time-oriented scheme is that there may be some flickers at the synchronization 
point when several delayed media units are discarded at the same time. 

If the main concern is to keep the completeness of a medium presentation 
as much as possible, the content-oriented presentation scheme can be adopted. 
Each medium unit is presented as much as possible. In Figure 0(b), medium unit 
k+1 is presented at time t + 29 and medium unit k+2 is presented at time t+39. 
The mathematics formula for obtaining the expected medium unit is as follows. 
At time t + iO , we assume that (1) the last presented medium unit is medium 
unit k + j, and (2) the received queue contains media units k + r\, k + r 2 , ..., and 
k + r n . Let the expected medium unit be k + m at time t + iO. Then, m is the 
minimum number of the subset of (j, ?T,r 2 , r n ), in which all of the elements 
in the subset are less than or equal to i, i.e., to = MINIMUM{x £ A \ x < 
i},A = {j, ri, r 2 , ..., r n }. The content-oriented scheme is suitable for the static 
media and the content-critical continuous media. The drawbacks of the content- 
oriented scheme are twofold, (i) The total presentation time may become longer 
than the nominal presentation time, and (ii) more inter-media asynchronous 
anomalies may exist until an inter-media synchronization is achieved. 



4 Performance Evaluation 

The presentation quality of a DMP can be evaluated by some essential QoS 
parameters, which are the presentation jitter and skew. In order to reveal Mcast’s 
efficiency, we compare the performance evaluation of Mcast with that of the 
traditional client/server architecture based on the essential QoS parameters. 
Both of Mcast and the client/server architecture are implemented according to 
our proposed synchronization and presentation schemes. The number of total 
presented frames is four thousands and the whole presentation contains two 
stages. For the reason of comparison, the pre-deposited audio/video frames at 
the virtual server are fixed to 20. The pre-deposited frames of a Mcast’s client are 
5; the pre-deposited frames of a client in the traditional client/server architecture 
are 5 or 25. The evaluation results that are depicted from Figure 0 to Figure 0 
are arranged as follows. Case (a) and case (b) are the evaluation results based on 
the traditional client/server architecture. The pre-deposited frames in cases (a) 
and (b) are 5 and 25, respectively. Case (c) are the evaluation results of Mcast 
and the pre-deposited frames are 5. We note that the pre-deposited frames in 
case (b) equal the summation of the pre-deposited frames of a virtual server and 
a client in case (c). 
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Fig. 4. Presentation jitter of the audio and video streams based on the content-oriented 
scheme. Cases (a) and (b) are based on the traditional client/sever architecture and 
case (c) is based on Mcast. 



Figure ^depicts the presentation jitter of the audio and video streams at the 
client site based on the content-oriented scheme. Having a longer pre-deposited 
length, the curve and the presentation jitter of case (b) are more stable and 
less than those of case (a). Although the pre-deposited frames in case (b) equal 
the total pre-deposited frames of case (c), the curve and the presentation jitter 
of case (b) are more oscillatory and larger than those of case (c). It is because 
the presentation overhead of clients in case (b) includes not only the media 
transmission overhead but also includes the synchronization overhead, which 
compensates for the WAN jitter. On the other hand, clients of case (c) presents 
synchronized media frames, which are synchronized by the virtual server. Since 
the volume of the video stream is much larger than that of the audio stream, 
the presentation jitter value of the video stream is larger than that of the audio 
stream. 

FigureO shows the presentation jitter based on the time-oriented scheme. We 
observe that the average presentation jitter based on the time-oriented scheme 
is greater than the average presentation jitter based on the content-oriented 
scheme. It is because more media frames are skipped based on the time-oriented 
scheme in order to catch up the nominal presentation. 

Figure 0 depicts the presentation skew of the audio and video streams at 
the client site based on the content-oriented and the time-oriented schemes. The 
curve and skew value of case (c) are more stable and less than those of cases 
(a) and (b). We note that the average presentation skew based on the time- 
oriented scheme is less than that based on the content-oriented scheme. It is 
because, based on the time-oriented scheme, each medium stream is presented 
according to the nominal presentation schedule as much as possible. Thus, the 
skew between media streams can be reduced. 
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Fig. 5. Presentation jitter of the audio and video streams based on the time-oriented 
scheme. Cases (a) and (b) are based on the traditional client/sever architecture and 
case (c) is based on Mcast. 





o ■ \W r lr I T I *1 




Fig. 6. Presentation skew between the audio and video streams. Cases (a) and (b) 
are based on the traditional client/sever architecture and case (c) is based on Mcast. 
The left evaluation results of cases (a), (b), and (c) are based on the content-oriented 
scheme; the right evaluation results of cases (a), (b), and (c) are based on the time- 
oriented scheme. 
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5 Conclusion 

This paper describes the major considerations, resolutions, and techniques that 
are involved in designing and developing a multicast multiple-stream multimedia 
presentation system. Based on the design and development considerations and 
the proposed synchronization/presentation schemes, the Mcast communication 
engine has been implemented. Mcast provides (1) the authoring tool to specify 
presentation’s appearance, and (2) the temporal control mechanism to achieve 
smooth presentations. Evaluation results reveal that the Mcast’s performance is 
acceptable and Mcast is more efficient than the traditional client/server archi- 
tecture. 
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Abstract. This paper deals with the problem of multimedia synchronisation in 
fully meshed multipoint videoconferencing applications. This work relies on 
previous work dealing with multimedia synchronisation in point to point 
videoconferencing: Up to now, the two main approaches are based on (1) 
timeline or on (2) temporal intervals composition. But, the first approach is not 
suited for multimedia data, and the second not suited for multipoint 
synchronisation. This paper then proposes a hybrid synchronisation solution 
taking advantage of the two preceding approaches. A multipoint 
videoconferencing tool based on this hybrid solution, called Confort, has been 
designed, developed, implemented and evaluated. This paper then deals with 
implementation details, as well as evaluation measurements. 



1. Introduction 

Videoconferencing is nowadays a mandatory tool for many real time collaborative 
working environments. A lot of videoconferencing tools already exist and among 
them it is possible to find industrial and commercial products as Netmeeting from 
Microsoft, or academic tools as VIC from LBL or Rendez-vous (the IVS successor) 
from INRIA. The market of videoconferencing tools is increasing rapidly, pushed by 
strong needs in concurrent engineering (in industrial domains as aeronautic or 
automotive manufacturing, electronic design, etc.). But the first projects dealing with 
concurrent engineering evaluated videoconferencing tools in actual experiments and 
showed that no collaborative working environment, and in particular, no 
videoconferencing application fulfil collaborative working requirements. 

In particular, the main problem with videoconferencing tools is their lack in term 
of QoS (Quality of service) guarantees. Among all the QoS parameters the most 
meaningful seem to be the audio quality that can currently range in a single work 
session from correct to incomprehensible (while it has to be constant), the end to end 
delay that has to be as short as possible to increase the interactivity level between 
users, and all temporal synchronisation constraints. In fact, users are really disturbed 
when the temporal features of the audio and video streams are not enforced, as voice 
can be hard or impossible to understand, and the lack of lips synchronisation makes 
the audio / video correlation disappear. Besides, multimedia synchronisation is 
currently the key point to address and the most difficult to solve for designing 
multimedia systems [3]. 
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In related work, multimedia synchronisation is most of the time not addressed, and 
when it is, the selected solution is based on timestamps. This is the case for all 
videoconferencing applications using RTP [12] as VIC, VAT or Rendez-Vous, even if 
it is now well known that the semantics of timestamps does not match the 
requirements of multimedia synchronisation. Based on this observation, some work 
lead to another synchronisation approach based on temporal intervals composition. 
This approach has been selected and improved at LAAS, and lead to the design of a 
point to point videoconferencing tool: PNSVS (for Petri Net Synchronised 
Videoconferencing System). So, this paper deals with the extension of PNSVS toward 
a multipoint videoconferencing application enforcing multipoint multimedia 
synchronisation. This new videoconferencing tool is called “Confort”. 

The remainder of this paper is as follows: first the point related to multimedia 
synchronisation is described (section 2). In particular, this section focuses more 
specifically on the problem of multiple users synchronisation. Then section 3 
describes the principle of the two main synchronisation approaches that were spoken 
about just before. Given the analysis of these two approaches, and based on the 
experience gained from the PNSVS design and development, a hybrid 
synchronisation is presented (section 4). Using this hybrid approach, section 5 
describes the implementation principles of Confort, and section 6 gives some 
experimental results and measurements. Finally section 7 concludes this paper. 



2. Synchronisation Problematics 

2.1. Multimedia Synchronisation 

The point of multimedia synchronisation is closely related to the characteristics of the 
media to synchronise. In fact, in a videoconferencing tool, media (audio and video) 
are dynamic with strong temporal requirements. The problematics of multimedia 
synchronisation consists then in ensuring both the intra and inter streams temporal 
constraints. Ensuring the intra-stream temporal constraints consists in controlling the 
jitters on each audio or video object in order to keep it under acceptable maximum 
values. It means that the temporal profile of media objects has to be respected on the 
receiving site. For instance, a 25 images/s video stream has to be played back with a 
rate of 25 images/s. The only tolerance is called jitter. 

Ensuring the inter-streams synchronisation consists in controlling the drifts 
between the audio and video streams (due to the cumulative effects of jitters) within a 
maximum value (sounds to pictures in lip synchronisation for example where lips 
movement has to be related to the voice stream). 



2.2. Multipoint Synchronisation 

Multipoint synchronisation can be seen as multiple users synchronisation. This 
synchronisation task aims to avoid semantics incoherence in the dialogue between all 
users. In fact, delays can change significantly in distributed systems, and it is 
mandatory to avoid that one user receives the response to a question before the 
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question itself. In collaborative systems, this kind of problem is generally solved 
using Lamport clocks that create in the system a logical order based on causal 
dependencies and that has to be respected by all users. 

But, in a videoconferencing application that computes dynamic media, as audio 
and video, having strong temporal requirements, the Lamport clocks seem to be quite 
limited. In fact, in a videoconferencing tool, interactivity is of major importance, and 
the semantics is closely related to delay and explicit time. Generally, it is 
acknowledged that for a highly interactive dialogue, delays between users have to be 
less than 250 ms [6]. Beyond this threshold, audio and video data can be considered 
too old and of no relevance. Therefore multipoint synchronisation has to design and 
set up mechanisms allowing the control of end to end delays between users, in order 
to ensure the suited order between audio and video information, on a temporal basis. 



3. Related Approaches 

As it has been mentioned in section 1 , two main approaches exist in the literature. The 
first one, and the most used, relies on a timeline basis - this is an absolute time 
approach - while the second relies on the composition of time intervals - this is a 
relative time approach. 



3.1. “ Timeline ” Approach 

This approach, consisting in using timestamps is easy and intuitive. In fact, each 
audio or video presentation object contains its own presentation date, and the 
presentation process has just to display it at the right time. This method ensure the 
intra and inter-streams synchronisation: a 25 images/s video sequence is replayed with 
a 25 images/s rate with respect of the temporal constraints on each object (intra- 
stream synchronisation), and the inter-streams synchronisation is also ensured, as 
each stream synchronises itself on a common temporal reference timeline and are then 
synchronised the one related to the other. 

However, timestamps do not solve the problem of acceptable jitters on information 
objects. In fact, timestamps are only a date on a time axe, and are then unable to 
integrate the acceptable variations on the presentation of each object. This approach, 
then, does not provide a suitable solution to the problem of multimedia 
synchronisation 



3.2. Temporal Intervals Composition 

The second approach frequently encountered in literature is based on temporal 
intervals composition [1], It means that the presentation time of one object depends 
on the preceding object presentation completion. Thus, only composing the time 
durations of all presentation data units can create multimedia synchronisation 
scenarios. This is the case for instance of the OCPN (Object Composition Petri Net) 
model [8], 
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To design guaranteed synchronised multimedia, it is mandatory to consider all 
problems due to the temporal variability of the computing times (jitters, drift) and to 
link them to the properties inherent to each multimedia object (for instance to link the 
jitters of an operating system to the acceptable jitters of each multimedia object). To 
define these synchronisation properties on the multimedia objects themselves, a model 
allowing to model the application synchronisation constraints is required. Several 
studies have already been realised in this domain, and some models have been 
proposed. In particular, the OCPN model only considers nominal computing times, 
and does not address the temporal jitters acceptable on most of the multimedia data. 

This limitation led [5] to propose a new model, called TSPN (Time Streams Petri 
Nets) taking into account this temporal variability aspect. By definition, TSPNs use 
temporal intervals on the arcs leaving places. The temporal intervals are triplets (x s , 

n s , y s ) called validity time intervals, where x s , n s and y s are respectively the 
minimum, nominal and maximum presentation values. 

The inter-streams temporal drifts can be controlled in a very precise way using 9 
different inter-streams transition semantics and the position of each time interval on 
the arcs leaving the places (allowing the computing of each stream apart from the 
others). Using these transition rules, it is possible to specify synchronisation 
mechanisms driven by the earliest stream ("or" synchronisation rules), the latest 
stream ("and" synchronisation rules) or by a given stream ("master" synchronisation 
rules). For more details, [13] gives a formal definition of the model and the different 
firing semantics. 

Example: Using a TSPN for PNSVS 

It is proposed in this part to use the TSPN model to describe the set of the timed 
synchronisation behaviours that appears in a videoconference application as PNSVS 

[4]. 
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Fig. 1 . Videoconference synchronisation constraints: TSPN based modelling 

As an example. Fig. 1 describes the requirements for a videoconferencing system 
whose QoS parameters are: 

• A throughput of 10 images per second. Then the nominal presentation time of a 
video object is 100 ms; 

• An acceptable jitters on the audio or video of 10 ms (in absolute value) [6]. It 
follows that the temporal validity intervals are [90, 100, 110] for one image and 
one audio packet; 
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• The audio is the most important media: inter-stream synchronisation is of «and- 
master» type, the voice being the master. In fact, voice being more important than 
video, it is defined as the master stream and its temporal constraints have to 
always be fulfilled; the «and» part of this rule tries as much as possible to wait and 
respect the constraints on the video stream to reduce its discontinuities; 

• The synchronisation quality: the inter-stream drift must be less than 100 ms [6]. 
100 ms being the limit under which the temporal gap between audio and video 
cannot be noticed. Then the inter-stream synchronisation period corresponds to the 
presentation of 5 images. This is because the maximum drift on 5 objects is 50 ms 
and the inter-stream drift is then less than 100 ms. 

As it has been shown before, as well as in [ 1 1] for PNSVS, the temporal intervals 
composition approach is perfectly suited for point to point multimedia 
synchronisation. But this approach is not so powerful for multipoint synchronisation. 
In fact, it is quite difficult with such an approach to synchronise all users and to 
control the end to end delay as no global clock is used and we do not assume any 
temporal properties on the delays in the network. In fact, each jitter on an object 
introduces desynchronisation between users. As with Lamport Clocks, TSPN only 
allows logical synchronisation (or ordering) between users. 

4. An Hybrid Solution for Multipoint Multimedia Synchronisation 

As on one side, timestamps are not suited for multimedia synchronisation, and time 
intervals composition not suited for multipoint synchronisation, it is not possible to 
use one of these approaches. But on the other side, time intervals composition is 
perfectly suited to multimedia synchronisation, and timestamps are, until today, the 
best way to achieve multipoint synchronisation and a global time in distributed 
systems. It is then proposed in this paper to use an hybrid approach where: 

• Multimedia synchronisation is achieved thanks to a time intervals composition 
mechanism; 

• Multipoint synchronisation is achieved thanks to the set up of a global clock in the 
distributed system. 

Concerning multimedia synchronisation in Confort, the principle is very similar to 
the one of PNSVS described in [4], as TSPN and more generally synchronisation 
mechanisms relying on temporal intervals composition are designed for 1 to N 
synchronisation schemas. Moreover, this approach has been designed for 
unpredictable distributed systems where delays, loss level, etc. are changing. In fact, 
the synchronisation constraints to enforce on each object depend on the presentation 
duration of the media object, and on the arrival date of the considered object. The 
synchronisation scenario modelled by the TSPN of Fig. 1 is refined, concerning actual 
presentation dates, each time a new media object reaches the receiving 
videoconference entity. In addition, as current general operating systems are 
asynchronous (as Unix, Windows, etc.), performing temporal synchronisation on 
media objects in low level layers is useless, as synchronisation features are disturbed 
when media objects cross communication and operating system layers. The 
synchronisation task has then to be put at the upper level of the receiving entity, in the 
application layer. As the synchronisation task is put on the receiving entity, 
synchronisation approaches based on temporal intervals are then ideally designed for 
1 to N synchronisation. 
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Given the TSPN of Fig.l, the synchronisation task on each receiver has only to 
implement a synchronisation runtime able to enforce the synchronisation constraints 
modelled by the TSPN, and to schedule media objects adequately. The principle of 
such a runtime is described in section 5, which deals with implementation issues. 

Concerning multipoint synchronisation and given a global synchronised clock over 
all the distributed system, it is then very easy to control end to end delay and causality 
in media object deliveries. In fact, it is sufficient to put in the header of media objects 
a timestamp (related to the global clock) and to control that the time between the 
sending and arrival dates is not greater than the maximum end to end delay required. 
This approach is very similar to the one of RTP (Real time Transport Protocol) that is 
recommended by IETF for audio / video Transport [2], except that we here assume a 
global clock. How distributed workstations clocks are synchronised to obtain this 
global clock is explained in section 5 (related to implementation issues). 

Finally, with the temporal intervals composition based approach that allows 1 to N 
multimedia synchronisation schemas and the synchronisation of all the workstation 
clocks - and therefore of all the sending entities of the videoconferencing session - an 
N to N synchronisation schema is achieved. 



5. Implementation Issues 




Fig. 2. Confort architecture (3 participants) 
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This section deals with implementation issues of the hybrid synchronisation task 
described in the preceding section. Thus, it is shown: 

- How multimedia synchronisation can be enforced with a strong respect of audio 
and video temporal constraints? 

- How the global clock can be set-up on a potentially wide area distributed system? 

- And how some other functionalities as openness and auto-configurability can be 
integrated in Confort, in order to make it rapidly and easily usable in any 
collaborative working environment? 

To illustrate all these implementation issues. Fig. 2 depicts the Confort software 
architecture. 

5.1. Principle 

An important constraint of the Confort development concerns its ability to work on 
general systems as Windows or Unix, with general hardware. We prefer to avoid 
dedicated systems and hardware architectures that are not widely available, a 
videoconferencing tool having to work on as many desktop computers as possible. 

However, implementing multimedia synchronisation mechanisms having to 
enforce temporal constraints requires some real time functionalities of the operating 
system, to be able to control the processes scheduling. For this reason, Solaris has 
been selected as it provides a real-time scheduling class, and it was the only one to 
provide this feature in 1994, when the work on videoconferencing began at LAAS. 
This real time scheduling class (also called RT) is really interesting as it gives RT 
priorities to user processes greater than the ones of system tasks, and also a 
predictable and configurable scheduler [7]. 

Before addressing the multipoint synchronisation point, some work has been done, 
at LAAS, to enforce synchronisation in the PNSVS point to point videoconferencing 
application [10]. And it has been said in section 4 that synchronisation mechanisms 
are located in the top layer of the receiving machine. Thus, the multimedia 
synchronisation problem can be solved in Confort as in PNSVS. In fact on each 1 to 
N multiconnection from one sender to all other participants of the videoconference 
session (the receivers), we can duplicate the synchronisation automaton of PNSVS on 
each receiving entity. 

Finally, as it is depicted on Fig. 2, each videoconferencing entity (on each 
machine) consists of: 

• A sending task that grabs video and audio objects and multicasts them to all the 
other videoconference participants; 

• For each distant participant, a synchronisation task that synchronises the audio and 
video streams coming from this distant participant. Then with N participants, each 
entity will run N-l synchronisation tasks similar to the one of PNSVS. 

As a reminder, the following recalls how the PNSVS intra and inter-streams 
synchronisation mechanisms are working. In fact, each stream is computed by a 
dedicated process (implemented thanks to a thread). But before being presented (on 
the screen for images or loudspeakers for audio), each object has to be computed by a 
specific hardware (video or audio board) managed by the kernel and with system 
scheduling class and priority. And there is no guarantee on the time that the 
computing of each image and audio object will take: this is a typical asynchronous 
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system. It is then required to create a second process (thread), in the RT class, whose 
function is to control the temporal behaviour of the presentation process. Running 
with the RT priority, the control process, also called orchestration process, can 
perfectly control the associated presentation process. It can stop (or kill) it if it 
overruns the maximum presentation duration, or it can also make it sleep until the 
minimal presentation time is reached: the intra-stream synchronisation constraints are 
therefore enforced. Then, as depicted on Fig. 2, each synchronisation task for 
synchronising audio and video streams consists of 4 threads: 

One audio presentation thread; 

One video presentation thread; 

One audio orchestration thread devoted to the control of the temporal 
behaviour of the audio presentation thread; 

One video orchestration thread devoted to the control of the temporal 
behaviour of the video presentation process. 

On the other side, inter-streams synchronisation is performed between the 
orchestration threads, that have to synchronise themselves on a rendezvous point 
(corresponding to inter-streams transition on the TSPN model), and this with respect 
to the semantic rule of the inter-streams synchronisation. 



5.2. Clocks Synchronisation 

Concerning clocks synchronisation, my first work consisted in investigating for 
existing solutions. Many are related in the literature, and they are always based on 
temporal information exchanges, over communication networks. But it also appears 
that none of these solutions do provide any guaranty on the precision they can reach 
on a general network as the Internet. Even NTP3 [9], that has been designed for 
synchronising clocks over the Internet, does not provide any guarantee, the 
synchronisation precision being dependant on the network load and RTT, for 
example. Nevertheless, on LANs, NTP proved to be really efficient, as LAN provides 
large bandwidth and low RTT. NTP is then a suited solution for synchronising clocks 
on a LAN and has therefore been selected. This choice is one among a lot of similar 
solutions that are as efficient as NTP. The advantage of NTP, compared to other 
solutions is its availability: many free implementations are available, and this protocol 
is already widely deployed. 

Nevertheless, it remains to solve the problem of synchronising machines on WAN. 
Up to my knowledge, there is no clock synchronisation protocol really efficient on 
WAN as the Internet. The selected solution is then the one offered by GPS, which 
transmit the universal time of reference atomic clocks by satellite links, with a 
precision of few microseconds. Thus, the solution retained consists in setting one GPS 
board on each LAN, and then to synchronise all the clocks of all the machines 
connected to this LAN, with the clock of the GPS board, using NTP. Similar solutions 
are sometime used for scalable and reliable multicast protocols. Thanks to this 
approach it is possible to synchronise very precisely the clocks of partners involved in 
a widely distributed videoconferencing session. Section 6 presents some evaluation 
measurements of this solution. 
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5.3. Additional Functionalities of Confort 

The other functionalities that have to be integrated in Confort are not related to 
synchronisation. The point here concerns the ability of the videoconferencing tool to 
be easily integrated in a more general and generic tele-working environment. These 
kinds of environments may or may not manage the work session membership, and 
may or may not send control and/or command messages. Depending on the work 
session control tool of the collaborative environment (maybe a workflow management 
system), the videoconferencing system has to: 

• Be able to receive control and/or command messages from the work session 
controller, and then to dynamically change the group of participants (by adding or 
removing some members), change the requested QoS on audio and video streams, 
etc. This ability is called “openness”. For that an interface has been designed, that 
defines the type of control/command messages, and fields they contain; 

• Be able to detect itself membership changes. This ability is called “auto- 
configurability”. Then, if the videoconferencing receiving entity detects that new 
audio and video streams arrive from a new participant (that were not participating 
in the session up to now), it automatically allocates the required resources 
(buffers), starts the synchronisation entity (consisting of 4 threads), and create a 
video window to display the received images. On the other side, if the receiving 
entity does not receive any data from one participant for a long time, this 
participant is removed from the active members, the allocated resources are freed, 
and the associated synchronisation threads are killed. This principle allows the 
application to always have a coherent knowledge of the state of participants group 
membership, even in case of failures or crashes. 

The architecture principle to develop such functionalities relies on the use of a 
stocker/dispatcher thread (Fig. 2). This thread is the master of the overall architecture 
as, depending on the type of data it receives, it can: 

• Store audio and video data in the right buffer, depending on who send it; 

• Change what has to be changed if it receives a control/command message 
(openness); 

• Adapt itself the dynamic architecture of the videoconferencing entity if it notices 
that a new participant is sending data, or at the opposite, if a participant does no 
more send data (auto-configurability). 

6. Results 

Confort has been implemented on Sun workstations (Sun Sparc Station or Ultra 
station), in C++, using the Solaris 2.5.1 or 2.7 operating system. Confort supports 
Parallax or SunVideo video boards that allow the system to grab, display, compress 
and uncompress images using the M-JPEG algorithm. Audio is the Sun standard 
audio system. Workstations running Confort communicate using the UDP transport 
protocol. Tests have been performed on top of a 10 Mbps Ethernet and a 155 Mbps 
ATM network. 

Confort is a full meshed and full duplex videoconferencing application that can 
process up to 25 images/s (320 x 240 pixels and 24 bits coded colours). 
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This part aims to evaluate the performances of the hybrid multipoint 
synchronisation task developed in Confort. First it is required to verify if all 
synchronisation solutions do respect the temporal presentation requirements, and in 
particular the quality required for the intra and inter-stream synchronisation. The 
measurements reported here have been realised in the case of a 10 images/s 
videoconference application whose synchronisation requirements were modelled by 
the presentation TSPN given in Fig. 1 , and with a 3 users videoconferencing session. 

Fig. 3. Shows, for the first 100 images received by one of the user from another, 
the jitters that appear during the presentation of each image. The measured jitters are 
the difference between the effective presentation duration and the nominal 
presentation time. This Fig. clearly shows that the maximum jitters value is always 
fulfilled. Note that the jitters is always negative: this is because in this experiment 
there is no network problems (no loss and no jitters, the goal being to evaluate the 
temporal synchronisation mechanisms) and then the needed data are available as soon 
as the presentation process needs them. Thus, the anticipation mechanism of the 
temporal intervals composition approach, that stops the presentation of an object as 
soon as its minimal presentation time has been reached, actually works. The very 
small variations are due to the real-time scheduler of Solaris 2. 

Fig. 4 shows the same measurements but now for the Confort audio stream. As in 
the video case, intra-stream synchronisation requirements are always fulfilled. Note 
that now the jitters are always positive. In fact, this is because firing the intra-stream 
transition is controlled by the audio driver signal. With a time scale expressed in 
millisecond, the presentation duration of one audio object is always equal to 100 ms; 
variations are due to the time required by the system to take into account this 
information. 
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Enforcing Multipoint Multimedia Synchronisation 75 



Finally, Fig. 5 shows the curve representing the inter-stream drift for the audio and 
video sequences. Fig. 3 shows that the video jitters is always negative and Fig. 4 that 
the audio jitters is always positive. Thus, the inter-stream drift (the difference between 
the audio and video presentation dates) is positive. It can be easily seen that this drift 
increases during each period of 5 objects (what is due to the cumulative effect of the 
jitters) and it is then reduced to a value near zero at each inter-stream synchronisation. 
It clearly follows that the inter-stream synchronisation requirements are perfectly 
fulfilled: the maximum drift never exceeds 50 ms, the maximum allowed value being 
100 ms. Note that the maximum value is never reached here because the audio jitters 
is always very close to zero and thus the inter-stream drift is essentially due to the 
video jitters. 

Therefore, the evaluation of multimedia synchronisation mechanisms proves that 
they perfectly work. It remains, nevertheless, to evaluate the mechanisms allowing the 
synchronisation of the clocks of all the workstations involved in the 
videoconferencing system. These measurements have been performed on a dedicated 
155 Mbps ATM local network, on which delays are perfectly known (at a 1 ms scale), 
and thanks to a network analyser. It appears that the desynchronisation between the 
three workstations clocks is always less than 2 ms, and this drift between clocks is 
essentially due to temporal variations in the transport layer, supported by a non real 
time operating system, and is not due to low level network layers. NTP, then, 
provides a suited solution for synchronising workstation clocks on a LAN. In fact, at 
the scale of a videoconferencing application, such a drift between clocks is not 
significant, and human users cannot notice an end to end delay variation of 2 ms. 

Nevertheless, Confort has not been tested on general wide area network as the 
Internet (it has only been tested on a national private network: SAFIR), and then we 
do not perform any measurement in such case. But, GPS manufacturers ensure that 
the precision of synchronisation with GPS boards (connected to reference atomic 
clocks by GEO satellite links) is of few microseconds. It is then of no significance 
compared to the precision of NTP on a LAN, and compared to the requirement of a 
videoconferencing application. 



7. Conclusion 

This paper has presented a new hybrid approach for synchronising multimedia 
streams in a multipoint videoconferencing application. The multimedia 
synchronisation relies on a temporal intervals composition basis to take into account 
the jitters that are acceptable on audio and video objects. Then, the interactivity and 
the coherence of multimedia information between distributed participants is achieved 
thanks to a clocks synchronisation mechanism. Taking advantage of a global 
distributed clock, it is then possible to perfectly control the end to end delay and its 
variations, and to enforce mulipoint synchronisation. The solution for synchronising 
distributed clocks is very practical as it relies on dedicated hardware (GPS) and NTP. 
Nevertheless, it has been proved that the hybrid solution proposed in this paper is 
really efficient. 

The limitation that appears is related to the portability of the Confort 
videoconferencing tool and the management of its evolutions. In fact, this tool is 
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written in C++ because it was required to access the real time functionalities of 
Solaris. But, each time the video board or the video drivers version change, it is 
required to reopen the source code, and to modify the video procedures. To avoid 
such portability and maintenance problems, it would be interesting to use generic 
video interfaces as the Java JMF, for instance, that solves all the compatibility and 
portability problems of video programs. But, today, the mapping between the Java 
Virtual Machine (JVM) and the real time services of the operating system is not done, 
the JVM having a really asynchronous behaviour. Nevertheless, the Java approach 
seems really interesting for portability and version maintenance. That is why next step 
will consist in studying the mapping between the Java runtime and the operating 
system kernel services. 
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Abstract. Future multimedia applications will evolve to content-rich, 
interactive presentations consisting of an ensemble of concurrent, related 
to the presentation scenario, flows. Recent research highlights the impor- 
tance of co-ordinating adaptation decisions among participating flows in 
order to share common congestion control state. We exploit models that 
quantify the effects of the dynamics of hierarchically encoded multime- 
dia content on perceived quality and present a mechanism to apportion 
the session’s aggregate bandwidth among its streams that improves the 
total session quality. Dynamic bandwidth utility curves are introduced 
to express the variability of multimedia content and represent the level of 
quality (or satisfaction) an application/user receives under given band- 
width allocations. The relative importance of the participating flows, 
determined either by the user or the application scenario, is also consi- 
dered. We discuss our approach and analyse simulation results obtained 
based on trace-driven simulation. 



1 Introduction 

Recently, the Internet has seen an explosive growth of real-time traffic. IP QoS 
mechanisms, particularly the IETF diff-serv and int-serv initiatives, aim to pro- 
vide the network infrastructure with the necessary mechanisms to support the 
delivery of streamed multimedia. This, together with the advent of access tech- 
nologies, like ADSL and cable modems and the continuous expansion of the 
backbone networks, allow delivery of higher volume and richer content multi- 
media since access rates of the order of Mbps are feasible. Traditional forms 
of multimedia collaboration, such as conferencing and audio/video streaming, 
are to be soon followed by more interactive applications: Internet TV, network 
games, tele-presence and complex immersive environments. 

In future multimedia applications, the user will be presented with a collection 
of concurrent multimedia flows: several camera views from a single sports event, 
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multiple video feeds from a virtual classroom. The user will also be able to 
prioritise incoming flows according to content or willingness to pay. 

However, the proliferation of unrestricted real-time traffic over the Internet 
is threatening network stability |7| ; real-time media need to be able to adapt to 
changing network conditions and share bandwidth fairly with other well-behaved 
traffic (TCP). In multi-flow applications, a number of related, concurrent flows 
will exist between a pair of hosts, and probably will experience highly correlated 
levels of service (delay, loss), if they share the same bottleneck links. We therefore 
believe that congestion adaptation decisions should consider the session’s stre- 
ams as a group, rather than individually. If concurrent flows exploit the shared 
state information about service levels, and co-ordinate their reactions, efficient 
multiplexing and bandwidth sharing can be achieved 0- 0- 

Encoded multimedia content (especially video) is characterised by widely 
variable resource consumption, primarily owing to changes in the spatial and 
temporal domains of the original sequence. The encoding algorithm, the encoding 
parameters used, and the image size exacerbate the variability. This means that 
under a fixed bandwidth allocation, the usability of the flow fluctuates. If the 
impact on perceived quality can be measured, then an inter-stream resource 
allocation mechanism can apportion the bandwidth appropriately. 

This paper addresses the problem of delivering multiple concurrent multime- 
dia streams from a single source to unicast heterogeneous receivers. We prioritise 
individual flows according to their time- varying usability, measured using utility 
curves and quality profiles, and aim to maximise the quality of the session as a 
whole. Utility curves express the quality a user or application is getting under 
different resource allocations. We assume that the utility of a stream can be de- 
scribed by a quality index, which maps transmission rate to a value. In this way, 
we differ from the model where sources adapt transmission rate to congestion 
without considering the resulting quality of each stream. 

This paper is organised as follows: Section 2 introduces the notion of utility 
curves and quality profiles, and presents relevant work. Section 3 describes our 
approach for allocating the session bandwidth among participating flows. Sec- 
tion 4 presents simulation results obtained. In Section 5, we identify applications 
our work can be applied. Section 6 concludes the paper. 

2 Utility Curves and Related Work 

2.1 Utility Curves 

Research in network pricing and resource allocation H3 extensively uses 
the notion of utility curves (or functions) to investigate methods for the optimal 
pricing of network resources. Utility functions map the quality (or satisfaction) 
that a service offers, when allocated a certain amount of resources, to a real 
number (usually in [0, 1]). So, if U is a utility function, then if xjy (i.e., allocation 
y is preferred to x), then U(x) < U(y), which means U is an increasing function. 

In a networking scenario, utility functions may depend on many parame- 
ters: available resources, encoding scheme, encoding parameters (e.g., image 
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size, frame rate, quantiser value), user preferences and the content of the media 
stream. However, trying to devise a solution based on all of these variables is 
likely to be intractable. In networked applications, bandwidth is the most im- 
portant resource, and we therefore limit our consideration to bandwidth utility 
functions. 

Bandwidth utility curves can be used to capture the adaptive nature of mul- 
timedia applications; these applications can operate over a range of resource 
availability. Effectively, a bandwidth utility curve is a mapping from the avai- 
lable network resource used by the application/flow to a value that represents 
the level of satisfaction (or dissatisfaction) of the application/user, at any time. 
This satisfaction index can be obtained using objective or subjective methods for 
assessing quality. Objective methods use specific signal metrics to assess quality, 
such as PSNR (Peak Signal-to-Noise Ratio). These metrics, while easy to obtain, 
often fail to correlate with the perception properties of the human audio-visual 
system. Quality assessment models based on subjective measurements provide 
more accurate results, but are more difficult to obtain. A widely used model is 
MOS (Mean Opinion Score) |T21 . where the perceived quality is usually rated 
on a 1 to 5 scale. 

Utility functions for adaptive flows can be obtained by using MOS score 
tables, and interpolating between values to produce piecewise linear utility func- 
tions (Fig.^left)). While a model based on subjective quality assessment is more 
reliable, it is difficult to generate the required results in real-time. Currently, 
there is significant effort being put into producing objective quality assessment 
models that exploit the properties of human perception (see i, eg, m and 
their references), but, especially for video, they are computationally intensive to 
run in real-time. These models produce output that most of the times is highly 
correlated with the results of subjective quality tests. 

Min-Max is a special case of a linear utility function, where an application 
only operates between a minimum and maximum of resources. When the values 
a resource can take are discrete (or the application is adaptive in discrete steps), 
then utility curves have a scalar shape. For example, a discrete quality curve 
is produced when we map the number of received layers from a layered stream 
to the stream’s quality. In Fig. Deleft), some commonly used utility curves are 
depicted. Exponentially-decay functions (Fig. [Deleft)) are introduced in ,6j, (El 
to describe bandwidth utility curves for adaptive applications. 

2.2 Dynamic Generation of Bandwidth Utility Curves 

Statically defined utility functions do not accurately model the burstiness of en- 
coded video. A mechanism that inspects video on-the-fly is needed. In m, a 
framework for dynamic generation of utility curves is presented. A utility gene- 
rator that uses machine-learning techniques is employed to classify the utility of 
an object into classes of utility curves. The range of utility curves is associated 
with scaling profiles that describe scaling actions; these describe the adaptation 
that occurs for given network conditions. Scaling actions could be, changing the 
quantiser, dropping frames, changing the colour depth, neglecting coefficients or 
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Fig. 1. Common utility curves (left) and hypothetical audio and two video curves 
(right) 



SNR (salesman, H.261, 25 fps) 




Utility (salesman, H.261, 25 fps) 




Fig. 2. Correlation of SNR and utility metric used 



dynamic rate shaping. The utility profiles and the corresponding scaling profiles 
are then transmitted; scaling actions can be taken at intermediary or boundary 
nodes (e.g., at the boundaries between a wired and a wireless network or base 
station). Similar work is presented in a, a, focused on MPEG-4 video. 

We use a model similar to ESI, and gather utility profiles off-line by assessing 
pre-encoded streams. The utility metric that is used is based on signal distortion, 
and is U(r) = 1 — err 2 / signal 2 (so, utility values are in the [0, 1] range), where 
err 2 and signal 2 are the mean square error and mean square signal respectively. 
The resource r represents the cumulative number of layers that the stream under 
consideration was encoded with. While a U value is calculated for every video 
frame, there is no reason to perform the inter-stream allocation on a per-frame 
basis, as this is a very fast time scale. Fig. Q depicts the correspondence between 
this metric and the signal-to-noise ratio (SNR) for a video sequence (salesman), 
and shows their correlation. Such utility values are biased by the way the frame 
is encoded (i.e., if it was I, P or B-frame) or by very fast (few ms) scene changes. 
We want to eliminate spikes and very short fluctuations in the utility curves, to 
avoid misinterpretations when the allocation process runs. In this way we try 
to track more meaningful longer-term changes in the content, where changing 
the number of transmitted layers will have the desirable effect. We use a expo- 
nentially weighted moving average U avg = a ■ U avg + (1 — a) ■ u cur to smooth 
the utility curve, where u cur is the current frame’s utility value, and a is the 
averaging weight parameter. Fig.0(left) depicts the instantaneous utility values 
derived from a video sequence, and its associated moving average approximation. 
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2.3 Session Bandwidth Sharing and Related Work 

SCUBA |P is a protocol for expressing receivers’ interest in streams of a multicast 
session. Preference for an individual flow, is scalably multicast from each receiver 
as a weight . Session bandwidth is allocated among the various streams according 
to the averaged preferences of the group as a whole. The cumulative transmission 
bit-rate for stream i with weight Wi is then Wi ■ B sess i on . The protocol also 
determines a layer mapping, which is effectively the assignment of transmission 
priorities to each layer of every stream. Receivers join layers of higher priority 
first. 

Youssef et al E3, present a model to perform inter-stream adaptation, where 
multicast senders adjust their transmission rates (or layers sent), according to 
feedback reports obtained from receivers. A QoS manager and an inter-stream 
adaptation module are used by the receiver to dynamically select the receiving 
streams’ operating status, according to network parameters, like delay, loss and 
jitter. Feedback is scalably multicast to the group, and enables senders to detect 
the highest rate (or layer) any receiver is expecting for the encoded stream. The 
transmission rate (or number of transmitted layers) can then be adjusted at 
the sender so that resources are not wasted in sending information (layers) the 
receivers cannot receive/handle. Bandwidth is also allocated based on the dyna- 
mically changing priorities of streams. 

Work presented in 0 describes a framework for the construction of end- 
to-end congestion control to allow an ensemble of flows that are destined to 
the same receiver to share common congestion information and make collective 
adaptation decisions. The framework is providing applications with an API to 
inform of the aggregate fair-share of bandwidth for a collection of flows. A hierar- 
chical round-robin scheduler is used to apportion bandwidth among flows based 
on weights (either pre-configured, or derived from receiver hints). 

Assigning bandwidth based only on preference weights does not guarantee 
that the receiver is getting best quality; this assumes that the resulting quality 
each flow offers to an application is proportional to the resource allocated to it. 
For reasons explained earlier, this is not always the case, so this method may 
give sub-optimal results. Consider Fig. Enright), which depicts three instanta- 
neous utility curves, for one audio and two video streams (such utility curves 
correspond to adaptive applications, pj). Assume that the three streams have 
preference weights of 0.4, 0.2 and 0.4 respectively, and that the probed fair-share 
bandwidth of the session is 640 Kbps. Then a weight-proportional allocation 
would allocate the points (A, B, C) as shown on Fig. Eright). However, we can 
obviously do better. By simply giving 128 Kbps from the audio stream to the 
H.263 video steam stream (points A’, B’), there is a noticeable improvement in 
the utility of the H.263 video stream, while the audio quality is not substantially 
degraded. This is because the usability of a flow depends on the nature of the 
media (audio, video), the media content and the encoding scheme used, so a 
proportional sharing of the session bandwidth does not always suffice. 
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3 Weighted Utility-Based Session Bandwidth Allocation 

In this section we present our approach to utility-based inter-stream session 
bandwidth sharing. We assume the transmission of hierarchically encoded stre- 
ams to cope with receiver heterogeneity US). Layered coding features the de- 
composition and encoding of the original signal into a number of cumulative 
signals (layers). Popular streaming applications (like RealPlayer and Windows 
Media Player) transmit a single stream pre-encoded at a certain bit-rate (for 
example, 56, 128 Kbps), and the appropriate version of the stream is chosen to 
match the receiver’s capabilities. This implies that many different versions of the 
same stream need to be stored at the content server side. In contrast, for layered 
coding, only a multi-layered encoded version of the stream is needed. 

We base our mechanism on the existence of a companion congestion manager 
module (like 13) to provide information about the aggregate bandwidth 
availability for the session. This bandwidth may be reserved or be a fair-share 
over a best-effort network path (TCP-friendly, or any other fairness criterion). 




Fig. 3. Smoothing of a utility curve (left), and utility curves and concavity (right) 



3.1 Algorithm to Determine Transmitted Layers 

Assume that the application consists of N different layered streams. By conside- 
ring layered flows, there are discrete operating points for each stream. For each 
stream, a quality profile exists in the form of tuples (t, l, U ( t , l)), where t is the 
time index of the utility value (i.e., frame or group of pictures number), l the 
number of layers the stream is encoded into and U the corresponding utility va- 
lue. Each stream i is assigned a preference weight Wi , and w i = 1- Denote 
by B aggregate the aggregate session bandwidth, provided by the companion con- 
gestion manager module. The aim of the allocation is to: maximise total session 
utility: £)ili Wi ■ U(ci), subject to: J^iLi Bi{ci) < B aggregate , where a = 0, 1, . . . 
denotes the number of allocated layers for stream i, and Ri is the cumulative 
bandwidth requirement for Ci layers of stream i. This belongs to the general 
knapsack problem. 

Rajkumar et al ca apportion a resource to multiple contending applicati- 
ons with one or multiple QoS parameters to maximise the overall utility. The 
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algorithm works optimally if the utility functions are increasing and concave. 
In the case of non-concavity, a greedy algorithm is presented that performs 
near-optimal allocation. Our approach is based on this greedy algorithm, as the 
concavity criterion cannot always be met. This is shown on Fig. EK right) which 
depicts a snapshot of real utility curves from the video sequences used in our 
experiments (Table [0- The algorithm operates as follows: 

1. Assign the initial constraint^]. The available bandwidth then is B ava u = 
Baggregate - Sill Ri{ c min), where Ri{c min ) is the bandwidth requirement 
corresponding to the minimum constraint. 

2. Calculate the weighted slopes, w, • -fli(cf) > ^ lay er l above the currently 

allocated layer Cj. It is clear that in the case of non strictly concave utility 
curves, the calculated slopes may include more than one of the discrete 
operating points of the flow under consideration (shown as dashed lines in 
Fig. EK right)). 

3. Find the maximum slope(s) among all unallocated layers, under the con- 
straint of available bandwidth to satisfy the corresponding assignment. 

4. Repeatedly increase the bandwidth for those streams found with maximum 
slope until all layers are allocated, or the available bandwidth cannot support 
any further allocation (exit condition). For each layer allocated, accordingly, 
subtract its resource requirement from B ava u. 

3.2 Time-Scale Issues 

Users prefer non-fluctuating quality, so it is important in the transmission of 
layered media that the number of layers does not change often. This means 
that, the bandwidth-sharing algorithm should only run at sufficiently distanced 
time periods. On the other hand, reducing the frequency of the allocation may 
result in low responsiveness of the algorithm, as sampling utility curves at low 
frequency in not meaningful. 

The period of inter-stream allocation also reflects the frequency with which 
an accompanying congestion manager module is queried to indicate the levels of 
aggregate network availability for the ensemble of flows. Such network probing 
interval (hundreds of ms or even secs) is significantly larger than the (video) 
inter-frame interval. Usually, in order to reflect meaningful network changes, 
it should be in the order of a few RTT s. Determining an optimal period for 
inter-stream allocation is not a trivial problem, so we set it as an application 
parameter, r. This is the synchronous mode of algorithm operation. We also 
provide for an asynchronous operation mode, so that significant spatio-temporal 
changes in the content can be accommodated before the expiry of r. When such 
changes occur, an exception is raised, the timer r is reset, and the algorithm runs. 
The exception is triggered whenever the difference between the instantaneous 

1 Initial constraints may exist in cases where a minimum acceptable level of service 
is required. This can be expressed by a minimum number of layers that need to be 
transmitted for the corresponding streams. 
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utility value and its moving average (El is bigger than a threshold value. 
This threshold value may be determined by the application programmer, based 
on subjective user evaluation. We intend to investigate methods of efficiently 
evaluating the value of the adaptation period according to quickly changing 
network parameters ( RTT ) and the error in the accuracy of utility values in 
further work. 



4 Experimental Results 

We ran simulations using ns- 2 (Network Simulator, v.2) [E]. We used three 
original video sequences with diverse amounts of scene activity which accor- 
dingly represent diverse requirements in terms of bandwidth consumption. The 
first ( Salesman ) is relatively low motion, CIF size (352x288), 450 frames, video- 
conferencing like sequence; the second ( Why) is a high motion, QCIF size 
(176x144), 1500 frames sequence extracted from a TV commercial. The third 
{Advert) is also a high motion, CIF size, 770 frames sequence from another TV 
commercial. We derived four encoded streams from the original three, encoded 
up to six discrete bit-rates (layers) using an H.261 and an H.263+ encoder, deve- 
loped at BT Labs EJ. and a publicly available MPEG code<0 Tabled summari- 
ses the features of the four chosen streams. In total, twenty-one combinations of 
(encoder, bit-rate) sequences were produced. A pre-configured preference weight 
was given to each flow, also shown on Tabled 

These four streams were assessed off-line using the metric described in m 
For each encoding point we generated the corresponding utility curves (Fig. d, 
on a per-frame basis. We do not scale the utility values obtained to a meaningful 
range (i.e. , 0-5), as this does not benefit to the execution of the algorithm. Those 
quality values are smoothed on a frame-per- frame basis, as explained in El 



Table 1 . Properties of video streams 



Sequence 


Size 


Encoding 


Cumulative bandwidth 


Preference weight 


Why 


QCIF 


H.263+ 


32, 64, 128, 256, 384, 512 


0.25 


Salesman 


CIF 


H.261 


64, 128, 256, 384, 512 


0.2 


Salesman 


CIF 


MPEG 


128, 256, 384, 512, 740 


0.25 


Advert 


CIF 


MPEG 


128, 256, 384, 512, 740 


0.3 



The simulated network topology consisted of a media server concurrently 
transmitting the four video flows to a single receiver over a network with a 
single bottleneck link. The bottleneck link capacity was set to 1.5 Mbps. We ran 
the simulation for 120 sec. We set the running average parameter a to 0.9, the 
allocation period r to 2 sec and switched off the exception operation mode. 

2 http : //www.mpeg . org/MPEG/video . html##video- software 
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(a) session bandwidth 





0 20 40 60 80 100 120 



(d) 60 background connections 




(f) Utility gain over number of background connections 



Fig. 4. (a) simulated session bandwidth over time, (b)-(e) improvement in total session 
utility (over time) for different levels of background traffic, and (f) mean utility gain 
and standard deviation over number of background connections 



Recent work shows evidence of self-similarity in Internet traffic m- To simu- 
late the background traffic at the bottleneck link, we subtracted from the bott- 
leneck link the superposition of several ON / OFF UDP sources, whose ON / OFF 
times are drawn for a pareto distribution with a shape parameter set to 1.5. The 
mean ON and OFF times were 1 sec and 2 sec respectively. During ON times the 
sources transmitted packets of size 1500 bytes at the rate of 32 Kbps. The number 
of concurrent background connections was varied between 10 and 80. Fig. Eta) 
shows the variation of the bandwidth available to the session over the simulation 
time. During the simulation, all utility profiles were periodically snapshot (on a 
per-frame basis) to produce their moving average curves. Every rsecs , the cor- 
responding moving average values for each flow and encoding point were chosen, 
which formed linear piecewise curves like the ones shown in Fig. Et right). The al- 
gorithm was applied on these curves and the number of layers to be transmitted 
for every flow was determined. 

Fig. Elb)-(e) depict the improvement in total session utility compared to a 
weight proportional sharing of the session bandwidth. The error bars represent 
the distance of the greedy solution from the optimal, obtained by solving the 
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knapsack. As shown, the optimal solution coincides with the greedy solution 
most of the time. We performed measurements to determine whether the amount 
of background traffic affects the output of the algorithm. Fig. W) depicts the 
mean and standard deviation of the utility gain (U total, greedy /U to tal,, proportional) 
for different levels of background traffic; it shows that the benefit obtained from 
a utility based apportion of the bandwidth is not degraded with decreasing levels 
of bandwidth availability, rather, it remains almost constant or even increasing 
as the number of background connections increases. 

5 Applications 

The Internet multimedia communication model will be increasingly interactive, 
and more content-rich. For example, Internet TV coverage of a football event will 
have, apart from the main field view, several other video feeds: zoom view, reverse 
angle, players close-ups, replays, etc. Virtual classrooms, immersive shopping 
worlds and tele-presence are also applications that involve multiple concurrent 
and diverse flows of changing importance. 

In applications that involve multiple flows, the component flows might be 
compressed using different algorithms, might have different quality and frame 
rate requirements, and consequently will have different quality profiles. Applica- 
tion/service providers will wish to get the maximum benefit from their network 
resources, in terms of total viewers’ satisfaction. In conditions of finite network 
resources, all flows cannot transmit at peak quality and quality trade-offs should 
be considered at the sender. However, this must be tempered in the light of 
other considerations, such as server workload sharing, bandwidth reservations 
and charging, and a resource sharing mechanism that maximises the perceived 
quality should be enforced. This will improve the usability of the service for 
both the user (who will get maximum benefit from their spending) and the pro- 
vider (who will be able to achieve efficient utilisation of resources, while offering 
maximum possible quality). 

In order to determine a receiver’s preference weights, notice of mechanisms 
that imply preference might be used. In a sports game, for example, a viewer 
that is watching the main field image will use a bigger portion of the screen to 
display that video, (e.g. SCIF) and this could be taken to mean better quality 
is desired. At the same time, several smaller size (e.g., QCIF) images from other 
video feeds could also be positioned on screen. When a user wants to switch the 
main feed of interest, the currently lower quality flow will become the main one, 
user’s preferences will be re-calculated, and eventually this will be reflected in the 
way bandwidth (or layers) is allocated by the source. This implies the existence 
of automated application cues or GUI assisted mechanisms for the setup (on- 
or off-line) of the user preference profiles. In addition, the mechanism requires 
that content servers will be able to generate (on-line or off-line) quality profiles 
for their codecs, content, etc. Advances in the generation of quality profiles that 
include assessment results highly correlated to human perception 0, E2 will 
strengthen the potential of mechanisms such as the one proposed here. 
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6 Conclusions 

In this paper we are inspired by two trends in multimedia networking; the evolu- 
tion of more interactive and content-rich applications made possible by advances 
in networking and significant market interest and the need for rate policing of 
real-time traffic and the benefits of making collective adaptation decisions over 
an ensemble of related flows that share common congestion information. We 
have presented arguments that illustrate the importance that the time-varying 
nature of multimedia content may have in an inter-stream bandwidth allocation 
process, when efficient objective quality assessment models are in place. For this 
purpose, we utilised a simple, SNR-based metric, and described a mechanism for 
inter-stream session bandwidth sharing that improves the total session quality. 

Although a quality assessment model that features SNR-like measurement as 
the quality index does not always provide reliable results, we did not intend to 
describe an efficient objective quality assessment model, but rather to indicate 
the usefulness of utility-based dynamic inter- media bandwidth sharing. More ef- 
fective real-time quality assessment models will be soon available with advances 
in coding techniques and higher availability of processing power. Off-line quality 
assessment can also be used in the case of pre-recorded, stored media to obtain 
quality profiles. These quality indices can be dynamically consulted at appro- 
priate time scales (i.e., on a GOP per GOP basis) and adaptation can be applied 
accordingly. 

The simulation results obtained showed an improvement in comparison to a 
scheme that shares the session bandwidth according to the preference weights 
of the flows. We note that the mechanism presented is coarse-grained since as it 
works at the granularity of layers. Future work will investigate how to further 
enhance the model to cope with fine-grained adaptation by adjusting and varying 
the transmission rates of the layers, to utilise any excess bandwidth below that 
needed to allocate extra layers. 

Acknowledgement. We would like to thank Orion Hodson, Graham Knight 
and Antony Steed at UCL for helpful comments and suggestions in this work. 
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Abstract. This paper presents a novel approach for interaction control in large, 
synchronous and loosely coupled but chairperson-controlled conferences. Based 
on the IP-Multicast protocol which is extended by mechanisms allowing resource 
reservation for prioritized flows, the proposed architecture supports interaction 
control among conference members as well as focus control of each conference 
participant. 

Interaction control is performed by using a scalable signaling protocol: a confe- 
rence may be recursively split into sub-sessions each of which provides an identical 
functionality independently of the others and establishes a singular session. A ses- 
sion is controlled by a chairperson who manually grants, revokes or rejects an 
interaction request of a registered session member. The granted interaction is an- 
nounced to the session participants such that all participants may have their focus 
automatically set to the audio and video streams providing session participant. 
The such provided focus control enables conference participants to individually 
manage their own audio and video stream perception, i.e. a participant either 
decides individually to whom the personal attention should be granted or follows 
strictly the session’s focus granted by the chairperson. 

The proposed conference control architecture provides the network-management 
platform to large virtual classrooms for university-like lectures over high-speed 
Internet-connections. 

While focusing the application of the conference control architecture to university- 
like lectures, it remains generally applicable for any audio- and video-supported 
chairperson-controlled conference over the Internet. 



1 Introduction 

The use of information and communication technology in teaching and learning has 
initiated a transition to remotely sourced university lectures; 111 ( are a few exam- 
ples. In such a scenario, lecturers and students are separated in space, i.e. students can 
attend lectures from almost everywhere and in particular from home. As already noticed 
by others it can be foreseen that this tendency will continuously increase GEES- The 
reasons are manifold: Professionals performing continuous life-long learning to keep in 
pace with the steady progress of research might be interested in reducing traveling time 
when attending a course. Students may prefer attending a lecture from home to extract 
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the relevant information better. Educational institutions might be interested in reducing 
costs by transposing the ever demanding need for university space to virtual classrooms. 

Teaching and learning in a virtual classroom can be seen as a large chairperson- 
controlled conference. It is a large conference since, in general, the number of participants 
is huge. It is chairperson-controlled because if a student intends to ask a question then 
she or he signals her or his intention to the lecturer. The lecturer grants or rejects attention 
to the asking student, and, thus, acts as a chairperson. In addition to asking questions, 
students should have the possibility of chatting with each other. Like in a traditional 
classroom setting this interaction should bypass the control of the lecturer. In order to 
create a feeling of being in a single virtual classroom we require also that interaction is 
supported by exchanging audio and video streams 0. 

In any large conferences, a form of coordination is required in order to prevent signi- 
ficant information from disappearing in the vast “background noise” created by chatting 
conference members. A possibility to cope with this problem is to enable a single speaker 
to provide information while all other conference participants may either focus on the 
speaker or, alternatively, create smaller discussion-groups. For the coordination among 
sites we have designed an application-level protocol that supports interaction control 
among conference members as well as focus control of each conference participant. Si- 
milar to traditional lecture situations, a centralized management directs the mainstream 
of the lecture while enabling and disabling request-reply like questioning. This paper 
focuses on protocol requirements and specification of a conference control architecture 
to be used in a synchronous distance education environment for university-lectures. 
Even more, the protocol can handle sub-conferences between groups of participants. 
For high-quality audio and video communication, the systems foresees the use of re- 
source reservation. The proposed conference control protocol is generally applicable in 
any conference-like situation which requires audio and video support and is led by a 
chairperson. 

1.1 Overview of This Paper 

In Section O we discuss assumptions and requirements of the assumed scenario. The 
protocol implementation is presented in Section 0 A discussion of related approaches is 
provided in Section 0] Section|3l provides an overview of the results and concludes the 
paper. 

1.2 Terminology 

In this paper we define our terminology as follows: 

- Lecture participants are divided into lecturer and students. The term lecturer may 
be applied to any conference-controlling chairperson. 

- The local classroom denotes the place where the lecturer is located. 

- Local students attend a lecture in the local classroom. 

1 In synchronous conferences, participants are separated only in space. The asynchronous model 
further separates the participants in time. 
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- Remote students participate in the lecture over the Internet. 

- The term virtual class denotes the aggregation of lecturer, local and remote students. 

- An abstraction of the location of the virtual class is named virtual classroom. 

- Lecture-interaction denotes a dialogue between students and the lecturer that is 
percept by the virtual class. 

- The lecturer controls lecture-interactions by using a lecture-controlling computer. 

- A participating site denotes a computer which is connected to the virtual classroom. 



2 Basic Assumptions and Requirements 

In this paper, we assume a distance education environment where students are in a 
local classroom as well as at remote sites. The virtual classroom is established over 
the Internet to which all remote students are connected. Each participating site runs a 
multimedia-capable computer which has attached camera, speakers and microphone. 
The computer is used for the following purposes: primarily, it receives audio and video 
data from the virtual classroom. Received audio and video are decoded, decompressed 
and presented to the corresponding end-devices. During interactions with the lecturer or 
other participants, the student’s computer acts as audio and video source. It compresses, 
encodes and sends audio and video data to the virtual classroom. 

In the near future, available bandwidth to users at home will increase by using bro- 
adband cable TV or any mode of the digital subscriber line technology as the underlying 
link. Differentiated 0] and integrated service qualities are about to be installed and 
configured in edge-border routers of Internet service providers. High-quality audio and 
video distribution over the Internet will become reality and, thus, will allow students to 
actively participate in virtual classes from remote locations. 

Observing traditional uni vers ity-likcQ lectures, the lecturer is the acting person. The 
audience is listening while viewing teaching aids (e.g. slides) and looking at the lecturer. 
In a lecture, participants may require a temporary change of the lecture-attention (floor 
control) by requesting the focus of the lecture to ask a question. The other participants 
are aware of this interaction. 

Separation in space should be bypassed in the virtual classroom so that the differences 
between a traditional and a virtual class are minimized. A major drawback of space- 
separated lecture participation is the inability to perform an individual discussion with a 
desk-neighbor. We therefore introduce the concept of an individual focus. Focus control 
allows a lecture participant to decide personally to whom it is granted: to the lecture or 
to the virtual desk-neighbor that wants to discuss an individual aspect (cf. Section lTKl . 

Today, distance education platforms lack the possibility of controlling and setting the 
individual focus to the participants own point of interests. However, this should not be 
done on the cost of the information flow of the lecture but additionally, i.e. the personal 
focus can be set to a discussion or chat with other participants. 

2 By the term university-like we mainly describe the behavior of students: they decide on their 
own whether they want to attend a lecture, do not want to participate or temporarily leave the 
classroom for ’’playing cards". At universities, it is the freedom of thoughts and behavior that 
creates this extraordinary spirit. 
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Summarizing, in a synchronous^ interactive distance lecture environment the follo- 
wing requirements are to be met: 

- A virtual classroom must be established to integrate fully the remote participants 
into the lecture. 

- High-quality audio and video (a/v) data must be transmitted bi-directionally to over- 
come the space-separation. 

- Floor control and focus control according to a lecture-like policy must be integrated. 

- Interactivity with the lecturer and interactivity between remote participants must be 
supported. 

- Scalability and flow prioritization are preferred over lossless transmission. 

- Platform and application independence is required to keep track with the develop- 
ment and deployment of off-the-shelf products. 

- A uniform management platform is needed that covers lecture participation and 
individual audio and video supported discussions among participants. 



3 Implementation 

Distance education systems as in use today, mainly cover the distribution of audio and 
video data from a lecturer to remote students E3- Interaction between student and 
lecturer is performed by sending e-mails or asking lecture-assistance via textual chatting- 
features. 

Some other distance education scenarios simply offer conference protocols fTOl 
based on the MBONE C3 tools. Again others 0 try to integrate audio and video distri- 
bution and whiteboard-applications in a single, hand-tailored application and, therefore, 
miss the required platform-independence to reach not only experienced users but be- 
ginners as well. Even extensions to applications and transmission of data require major 
applications changes. 

Assuming huge numbers of participating students as in traditional lectures, the tran- 
sition to lectures in the virtual classroom goes with the possibility of a major extension to 
the number of listening and actively participating students. Thus, a centralized approach 
to lecture membership- and floor-control like in 0 may impose a major hurdle to the 
aspect of scalability. 

To summarize, currently available distance education platforms have the following 
drawbacks: 

- No audio and video interactivity neither between remote students and the lecturer 
nor among remote students. 

- No distance education-tailored platform supporting the conceptual behaviors of uni- 
versity-like lecturers in a virtual classroom. 

- No application and platform independence. 

- Restricted scalability. 

3 In contrast to tightly coupled ones (refer to ill II ) where conference participation and information 
access is strictly controlled, the loosely coupled session model allows a broader and more opened 
joining semantics. 
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Recognizing the drawbacks mentioned above, we propose an architecture for univer- 
sity-like lectures in a virtual classroom to manage scalable session control, floor control 
and thus interaction control based on the concept of an individual focus. 

3.1 Conceptual Overview 

Our control scheme follows strictly the procedures of students attending a university- 
lecture, i.e. the general focus of interest is set to the lecturer. A student that intends to ask a 
question signals the intention and waits until the lecturer grants the requested attention. 
Besides the ’’official" interaction with the lecturer, students may whisper with their 
virtual desk-neighbors without disturbing the main flow. Lecture-interactions as well as 
individual discussion-groups of remote students are performed in real-time using audio 
and video data-streams. 

The provided distance education platform is implemented in a strictly layered- 
approach. The coordination and selection of high-bandwidth data streams are separated 
from input and output applications by providing a local gateway application, called 
focus-control tool (see FigureQ]). Thereby, we can provide platform and application in- 
dependence. Any application on any platform that provides the required interfaces with 
respect to networking infrastructure and data formats of the virtual classroom, can be 
installed on remote students’ systems. 



User Interface Applications 






Focused 

Multimedia 

Data 






Focus Control 
Management 



Focus Control Tool 




Network 



Fig. 1 . Focus Control Tool And User Interface 



3.2 Abstractions 

Our architecture is based on the following abstractions: 

- Session: A session denotes a communication group. 

- Session Holder: Every session is controlled by a session holder. 

- Sub-session: Sessions may be ’’forked" into sub-sessions. 
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- Parent Session: The term parent session denotes the origin of a sub-session. 

- Floor : The floor of a session denotes the transmitted information. 

- Floor Flolder : The floor holder provides the source of information within a session. 

- Participant : A participant joined a session. 

- Registered Participant: Only registered participants may actively participate, i.e. 
interact with the session. 

3.3 Communication Channels 

The proposed conference control architecture manages the flow of data through four 
primary communication channels (see Figure Q: 

- SAC: The session announcement channel provides the session description to join a 
session; see Section lT51 

- FDC: By the focused data channel , audio and video data are received. 

- ICC : By the interaction control channel , floor control is bilaterally coordinated but 
controlled by the chairperson of a session. While the session holder decides to whom 
the floor is granted, the decision is announced to future session holders in advance 
by the interaction control channel. 

- TAC: The teaching-aids channel provides the flow for teaching aids (e.g. slides) and 
annotations to those. 



Classroom 



Remote Student 
With Granted 
Classroom-Focu: 




Participating Re- 
mote Students 



Fig. 2. Channel Overview 



Session Announcement Channel: If a newly created session’s scope addresses a multi- 
cast group, it becomes an open session. Registered participants of the parent session 
may join this sub-session. By the session announcement channel (SAC), participants may 
set their focus to the current floor holder whose audio and video is sent. 
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Focused Data Channel: Using the focused data channel (FDC), audio and video data 
are transmitted. An FDC exists per participating computer. It is dynamically set (cf. 
Section im according to the session announcements received by the SAC (cf. Section 
Id. 31 ). 

Interaction Control Channel: By the interaction control channel (ICC), the dialogue 
between participants is controlled (cf. Sections 13. SI and Id. 41 . The ICC is initiated by a 
participant that intends to interact with the session. 



Teaching-aids Channel: The teaching-aids channel (TAC) exists only in the root session 
which is defined as the lecture itself (cf. Section EH. All participants of the virtual 
classroom continuously receive the teaching aids transmitted by this channel even if 
they participate in sub-sessions. The TAC remains valid during a teleteaching session. 

3.4 Session Establishment 

Conceptually, every session is established if a participant sends a session request to 
another participant (a state transition diagramme is shown in Figure 0- If the callee is 
willing to change his actual focus and to establish a sub-session, a session grant message 
is returned. Otherwise, the session request is rejected. In the established sub-session, the 
callee becomes the session holder, i.e. the chairperson. The callee and the session holder 
are marked as registered participants. 

Both, the session request message as well as the session grant message, are manually 
initiated. The caller retrieves the session participants register (SPR) via a web-based 
interface and receives the address of the callee by selecting the appropriate entry. After 
having selected an entry, a session request message is automatically sent to the callee. 

The callee now decides whether he wants to create a private, unicast session or an 
open, multicast one. In either case, the callee becomes the session holder. 

If a multicast session is established, the session holder sends a registration message 
to the parent session. This message extends his or her entry in the parent SPR by the 
address of the newly created SAC. By representing this additional information, the callee 
becomes noticeable as a sub-session holder. The address of the appropriate SAC in the 
SPR is used to join a sub-session (cf. Section ITtIi . 

Neither registered nor un-registered session participants provide continuous video 
or audio streams to the floor. To deal with bandwidth constraints only floor holders are 
allowed to send their streams. By that, session participants are not continously aware of 
other attendees. Registered participants are marked within the SPR of the session holder. 

A unicast session does not require a registration of the session holder in the parent’s 
SPR; a unicast session in the virtual classroom represents the desk-neighbor whispering 
in a traditional lecture. 

The sequence of messages sent during a session establishment is visualized in Figure 
0 In both cases, the initially established channel to send a session request message to 
the session holder becomes the ICC during the interaction period (cf. Section H~8l i. 

At the beginning of the lecture, the lecturer and the lecture information itself are 
registered in the SPR of the web-page which belongs to the lecture. 
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Session Accept/ 
\ Session Grant 



Scope Set/ 
Register Multicast 



-/Create SAC 



Fig. 3. Message Sequence 



3.5 Session Join 

Joining a session requires the address of the appropriate SAC. This information is retrie- 
ved by traversing down the tree of sub-sessions starting at the root session’s SPR or at 
the SPR of the current session if a sub-session should be joined. If an SPR is contacted 
for getting information of a sub-session, the requesting participant gets registered in the 
contacted SPR but not in the targeted sub-session, i.e. her or his participation is logged 
in the SPR. Clicking on the web-based SPR-representation transmits the information 
where to connect to the SAC of the targeted sub-session. 

The information being transmitted on the SAC describe the sub-session as follows: 

- Master session’s SAC-address. 

- Master session’s TAC-address. 

- Parent session’s SAC-address. 

- Floor group-address. 

- Session holder. 

- Floor holder. 

- Validation checksum of the SAC-data. 

The information to connect a SAC is used to perform a multicast-join operation into 
the SAC of the sub-session and, if applicable, into the TAC of the lecture. Receiving 
the information transmitted on the SAC provides the address of the floor, i.e. the FDC. 
The received data is used to set the focus of the participant to the current floor holder 
(see Section IT71 . Joining a sub-session changes the focus of the participant. However, 
teaching aids are always sent on the TAC. 
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3.6 Session Close 

A session close is either performed individually by the session participant or, for all 
members of the session, by the session holder. The protocol provides the information to 
reconnect to the parent session or return to the root session. 

If the session holder decides to close the session, this intend is signaled to the 
session before the SAC is closed. This session close-announcement is sent to allow 
the participants to have their focus automatically set to the parent session or to the root 
session otherwise. 

If a participant intends to cancel its participation in a sub-session, he or she either 
returns to the parent or root session: a multicast-leave operation on the current FDC 
and SAC and reconnects to the appropriate SAC and FDC as described above. If a 
registration in the current session was once performed, the participation is de -registered 
if a controlled session close is performed by the session participant. Otherwise, if the 
participation is canceled by a system crash or abnormal program termination and the 
exiting participant was registered, the entry in the SPR is kept for the lifetime of the 
session. Session requests and join procedures (see Sections F.4l and l3~5l to non-existent 
registrants are correctly handled by the focus control tool: no new session is established, 
the participation is kept unchanged. 



3.7 Focusing 

Focusing on the floor holder is performed by using the information provided on the SAC. 
The requirements for audio and video transmission impose quality of service constraints 
on bandwidth and delay. Thus, channel reservation for the receiver site is required as the 
individual focus-concept relys on information-channels. However, unwanted data may 
still arrive to the participant’s computer. Therefore, the Focus Control Tool (see Section 
13.81) provides a local filtering according to the currently set focus. The Resource Reserva- 
tion Protocol (RSVP IfHH ) is used to setup quality of service constraints in edge-router^ 
at the participants’ Internet service providers (ISPs). For scalability reasons, it is assu- 
med that the backbones interconnecting the ISPs provide enough bandwidth and small 
delays to fulfill the requirements for live video and audio transmissions. Differentiated 
services a between ISPs will guarantee the required qualities for the virtual classroom. 
It is further assumed that backbones establish RSVP-tunnels for signaling purposes. As 
soon as a participant joins a session or recognizes an announcement-change, the cur- 
rent focus is teared down by sending the appropriate RSVP-TEARDOWN message to 
the routers; afterwards, an RSVP-RESERVE message is sent along the receiving path. 
RS VP-signaling is performed by the focus control tool (cf. section fTRI independently of 
the user application outputting the received data. All this gives the required application 
independence in virtual classroom sessions. 



4 By the term edge-router we denote the routers at the outer border of the Internet, i.e the ISPs, 
and not the interior routers of backbone networks. 
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3.8 Interaction Control and Floor Control 

Interaction control is performed by using the ICC. A participant that intends to acquire 
the floor establishes a unicast ICC to the session holder. A session participant becomes 
registered, if not already done, by sending a floor request. If a new session must be 
created, the channel established to send a session request to another participant becomes 
the ICC between the session participants (see Section mil . 

By the ICC, a floor request is sent to the session holder. The session holder manually 
decides whether he or she wants to grant or refuse the floor to the requester. Refusing a 
floor request results in the termination of the ICC (see Section lT51) . If the floor is granted, 
a notification called //rwr grant message , is sent by the ICC to the requester. Besides the 
notification sent by the ICC, the change of the floor holder is, of course, announced on 
the SAC. The interaction policy imitates the known protocols of traditional university 
lectures: if the session holder intends to explicitly ask a participant, he or she asks the 
question by audio and video to the session and expects the callee to establish an ICC 
and follow the procedure as described above. 

Until an interaction process is either closed by the session holder or canceled by the 
participant, the ICC remains established. Closing the ICC denotes the termination of 
the interaction. Using the ICC, request-reply alike dialogues are possible. It is further 
assumed that during discussion several ICCs per session are kept alive: the session holder 
decides “on the fly” to whom he or she wants to grant the floor. 

If the session holder intends to ask a participant explicitly, the session holder articu- 
lates the question on the floor and addresses the participant by voice - as done nowadays 
in traditional university-like lectures - and expects that the participant responds by a 
floor request message as described above. 



The Focus-Control Tool. (FCT) provides the gateway functionality as described above 
(see Figure JT§. Session participation, floor and session request as granting both of them 
are handled by the focus control tool. It provides a clearly defined interface to allow 
graphical display tools being connected. 

The FCT receives the SAC, FDC and TAC. After data streams passed filtering and 
validation tests, they are locally provided to the applications. This gateway functionality 
provides the above requested platform and application independence. If the display tools 
running under UNIX allow data reception for example via UNIX domain-sockets 0, 
filtered data are locally provided by this interface. If, on the other hand, display tools 
being started under other operating system only provide traditional IP-interfaces JS|, 
valid data are transparently provided on such interfaces. Applications, originally not 
multicasting capable, may participate in the virtual classroom: the FCT transparently 
performs the unicast to multicast and vice versa gateway functionality. Intercepting the 
SAC-data, resource reservation signaling can be provided by the FCT to applications 
that are unaware of quality of service possibilities. The amount of transmitted audio and 
video data is reduced since FCT sends them out to the floor only if the floor is granted 
to that participant. 
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3.9 Teaching Aids and Annotations 

The floor holder of the root session, the lecture, is allowed to provide annotations to 
the transmitted teaching aids. Since the floor focusing and filtering of the data streams 
is provided by the FCT, the teaching aids display- and modification tool could behave 
as if it is was a single-user application. Annotations to the teaching aids are added on 
the lecture-controlling computer. Modified information is sent to the participants of the 
lecture. 



3.10 Media Scaling 

By the term media scaling the transformation of input to output data regarding the 
available output-bandwidth is meant. Expecting the functionality and widespread use of 
active Internet nodes m to become real in the near future, we rely on the possibilities 
offered by node-plugins to perform the required media scaling so that a participating 
computer always receives the audio and video data in a best-possible quality. 



3.11 Failure Handling 

The proposed protocol follows the commonly available approach for reservation-failure 
handling in an Internet environment: If path-reservations fail, best-effort transmission is 
used per flow. Failure scenarios are identified for lost RSVP-reservation messages and 
the in-capability of routers on the path to provide the required resources. In case of a not 
established user-lecturer interaction, it is up to the student to signal the interaction-intent. 

Currently, no service redundancy to provide failure recovery is foreseen in case of 
session controlling computer crashes. In a future generation of the currently provided 
virtual classroom description, failure recovery procedures must be integrated. A possible 
solution to provide the fault-tolerance might be the mirroring of the lecture-controlling 
computer. An in-depth analysis of this topic is subject of further studies. 



4 Related Work 

Besides the vast literature available on the topic of distance education, only few publicati- 
ons address the thematic of the underlying signaling protocol. Therefore, we concentrate 
on the ones that provided the most influence during the design of this protocol suite. 
The signaling protocols of the MBONE tools and the protocol concept of digital lecture 
boards (DLB), the ITU-T. 120 recommendation as well as the MACS environment. 



4.1 Session Description Protocol (SDP), Session Initiation Protocol (SIP), Session 
Announcement Protocol (SAP), Questionboard (QB) 

The protocols SAP, SIP and SDP fT3l. E2 1 . lEBl are primarily designed to convey the 
information required to support the MBONE O tools. They are, spoken in general 
terms, the basics required to establish the session directory SD El and to be used 
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by vie E3, vat EM and wb 8271 . By the use of the MBONE tools, loosely coupled 
conferences are created. 

SAP provides the protocol to announce different sessions to be registered in the SD. It 
describes the information layout of the announced session. The principal purpose of SAP 
is the transmission of the SDP and SIP data. By SDP, the session is described in the SD. 
It provides the description of session participants to the SD which imposes limitations 
on scalability: explosion of messages. SIP, on the other hand, describes a method to 
invite computer users to participate in an MBONE session. Data transmitted provide 
information to start automatically the required applications if configured correctly. 

The idea how to join a session in our architecture stems from (T9i l. The definition 
of the transmitted information on the SAC was influenced by SDP. We focused on 
the computer processing-performance aspects and therefore reduced the SAC-data to 
a binary and thus easier to process form while the MBONE signaling protocols are 
character-based. While SIP provides the invitation to join a session, we changed the 
direction and provide a more university-lecture oriented join mechanism. The concept 
of the SPR was influenced by the SD. We provide a web-based static form of the the 
information to join a session on the parent’s session holder computer. 

Floor control as provided in the herewith presented interaction control architecture 
is similar to the concepts evolved by the MBONE questionboard (qb) m. While qb 
provides moderated sessions where the floor is granted by the moderator as well, it limits 
its application due to the following reasons: 

- Qb is designed for the MBONE. By that it relies on the conference bus ll'bll as desi- 
gned for the MBONE tools. For proper operation, the MBONE tools are required. 

- Qb does not allow the explicit revocation of the granted floor by the moderator. 

- Hierarchical (sub-)session management is not provided. 

Nevertheless, an aspect not covered by our approach resembles qb’s recovery mecha- 
nisms by multicasting “hello-alike” host-startup messages. 

The loosely coupled conference-join semantic of our conference control architec- 
ture is similar to the MBONE tools (IP multicast) while our centralized coordination 
resembles the original need for this protocol: the distance education environment. 



4.2 The Upper Rhine Virtual University (VIROR) 

In VIROR I'itSil the universities of Mannheim, Heidelberg, Freiburg and Karlsruhe in 
Germany developed a tele-teaching application-platform. They rely on the MBONE- 
tools to perform the protocol-oriented communication constraints and provide a tools- 
wrapping application. A drawback in their approach clearly is the dependency of the 
underlying operating system platform while not separating the management-flow from 
the user-interfacing information-flow. 

In VIROR, evolved from the project Tele-Teaching Uni Mannheim-Heidelberg [51, 
floor control is implemented in a similar manner as our protocol does, but requires 
synchronized finite state machines on each participating computer. It further differs in 
the degree of interactivity between lecture participants; private chat-sessions between 
virtual desk-neighbors are not provided (see Section IT%I . 
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4.3 ITU Recommendation T.120 - Data Protocols for Multimedia Conferencing 

The ITU-T. 120 family of protocols | [1 2| is designed to be used as basis for multime- 
dia conferencing over public switched telephone networks. Similar to our approach, 
the ITU-T. 120 family provides a centralized moderating functionality. In contrast to 
our interaction control architecture, the ITU T.120 family of protocols requires relia- 
ble communication. Multicasting is not intended. By that, this centralized approach 
provides explicitely, server-controlled connections per remote location. A specialized 
Multipoint Control Unit is required to provide the virtual interconnections among the 
single multimedia terminals. So, dynamic scalability is limited. 

4.4 MACS - Modular Advanced Collaboration System 

The approach taken by the project MACS II I .111 provides a similar concept as the one 
provided by our architecture. A floor and session control library performs the appropriate 
actions. This library provides an interface to allow application programmers to adapt 
their code. Portability is reached by the use of JAVA as library programming language. 
Currently, no hierarchical session management does exist. 

While MACS provides a similar concept of establishing sessions, scalability limi- 
tations could exist in the tight management restrictions imposed by the registration 
mechanism. Hierarchical session management covered by a single user-interface is not 
foreseen. No integration of resource reservation protocols or differentiated service me- 
chanisms is foreseen in contrast to our architecture (cf. section CC71 ). Portability as pro- 
vided by the JAVA-based library requires involved applications being adapted while 
our approach using the gateway functionality (cf. section rm provides platform and 
application independence. 

5 Conclusions 

In this paper we have proposed a new virtual classroom architecture. The architecture 
provides interaction and floor control protocols to coordinate large audio and video con- 
ferences over high-speed Internet connections. Interactivity rules imitating university- 
lectures provide a scalable coordination platform for virtual classes. Scalability is reached 
by the distribution of coordination-competence to session holders, and the minimization 
of registration overhead. The approach of using native IP-multicast over the Internet 
supports a large number of receiving conference participants. 

High-quality audio and video streams that enable participant to perceive everyone’s 
interaction with the session holder are managed by using RS VP on the outer-border rou- 
ters and differentiated services in the backbones. Audio-visually supported discussions 
between the lecturer and remote students as well as those among students establish a 
virtual classroom. Due to the loosely coupled organization of multimedia sessions and 
the proposed use of RSVP only on the edge border routers of the Internet, scalability 
of the provided architecture should be given. Further measurements of the deployed 
architecture within larger trials are required to prove the scalability. 

Management is done by providing a control tool which acts as a gateway. Thus, a 
platform-independent control and management tool coordinates the reception of data 
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flows from the virtual classroom as the session per se while off-the-shelf products can 
be used for multimedia input/output purposes. Altogether, our proposed coordination 
architecture and protocol cover the requirements put forward in Section[2|and deals with 
the problems mentioned in Sectional 

In this paper, aspects of security and membership control and effective memberships 
within a lecture (who is allowed to actively or passively participate) are not covered. The 
problem of multicast-address allocation is out of the scope of this paper. The reader is 
referred to m for further discussions. 

The proposed platform will provide the network resource control to the project Easy 
Teach and Learn*^ Q of the Communication Systems Group at Swiss Federal Institute 
of Technology. 

Acknowledgments. We would like to express our acknowledgments to the Swiss Federal 
Institute of Technology (ETH Zurich) and the Hasler Stiftung for funding the project of 
Easy Teach and Learn. 
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Abstract. IP Telephony recently finds a lot of attention and will be used 
in IP based networks and in combination with the existing conventional 
telephone system. There is a multitude of competing signaling protocol 
standards, interfaces and implementation approaches. A number of ba- 
sic functions can be found throughout all of those, though. This includes 
the addressing of participants using symbolic names, the negotiation of 
connections and their parameters as well as the enforcement of a dedica- 
ted handling of data streams by means of QoS signaling activities. Thus, 
a generic abstraction hiding underlying protocol specifics is very desira- 
ble and useful. The Delivery Multimedia Integration Framework DMIF - 
as part of the MPEG approach towards distributed multimedia systems 
- forms a general and comprehensive framework that is applicable to a 
wide variety of multimedia scenarios. 

In this paper we describe a more generalized and abstract view to basic 
IP Telephony signaling functions and show how these can be hidden be- 
low a common DMIF interface. This will allow for the implementation of 
inter-operable applications and a concentration on communication func- 
tionality rather than protocol details. We expect that this will also allow 
for better exchangeability, interoperability and deployability of emerging 
signaling extensions. 

Keywords: Internet Telephony, Signaling, SIP, H.323, MPEG-4, DMIF 



1 Introduction 

IP-Telephony applications are considered to have a huge economic potential in 
the near future. Because companies and service providers start to consider it to 
be getting ready for carrier-grade usage, it may also speed up the deployment of 
state-of-the art QoS, security and billing components in local as well as in the 
backbone networks. Though IP-Telephony might be seen as (just) a specific ap- 
plication today, it is part of an ever emerging scene of more general multimedia 
applications. Considering the high dynamics and multitude of concurrent appro- 
aches in signaling protocols, interfaces and implementations, a consistent and 
comprehensive framework can speed up development and allows a faster, better 
and more generic implementation as well as the interoperability, exchangeability 
and re-use of modular components. 



H. Scholten and M. van Sinderen (Eds.): IDMS 2000, LNCS 1905, pp. 104-^^^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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2 IP-Telephony Signaling Protocols and APIs 

IP-Telephony signaling protocols are used to establish a conversation compara- 
ble to a classic telephone call using an IP infrastructure. Typical applications 
and scenarios are recently based on different protocol suites. Mainly, there are 
two major approaches - the H.323 00 protocol family and the Session Initia- 
tion Protocol SIP ^ with a changing distribution and relevance. Though today, 
a high percentage of applications and scenarios is still H.323 based (and we will 
therefore initially focus on it), it is supposed that in the near future the use of 
the SIP protocol may increase mm- Both protocol types will even be usable 
together with appropriate gateways [ ? ]P3- Additionally, the close interaction 
with the existing Public Switched Telephone Network (PSTN) on the basis of 
interacting media gateways plays a very important role. For that domain, proto- 
cols like MGCP 0 / H.248 0 that describe the interaction towards the PSTN 
SS#7 |12| and may use both H.323 and SIP for signaling within the IP telephony 
world are under recent development and standardization. 

2.1 Signaling Using the H.323 Protocol Family 

The H.323 protocol suite compromises a variety of communication relationships, 
which are handled via dynamically negotiated channels for a number of H.323 
protocol components such as RAS, Q.931, H.245. These use Protocol Data Units 
that are encoded as described in ASN.l specifications. Though the H.323 proto- 
col suite has proven to provide the intended communication services especially 
for usage in LANs, it is considered to be complex, not easy to extend and ha- 
ving a considerable signaling overhead that can not be neglected in a global 
environment. 

2.2 Signaling Using the Session Initiation Protocol SIP 

The Session Initiation Protocol SIP has initially been used as a protocol for 
multicast applications and provides generic control functionality. Its basic ope- 
rations which are directly related to call setup are registration of participants 
and redirection or proxying of control data traffic. This allows to access tele- 
phony services through single points of contact, may hide infrastructure aspects 
and is also applicable for building hierarchies. 

Additionally to its primary function, SIP allows to control call proceeding 
and additional services in a very generic, efficient and extensible manner, e.g. by 
means of Call Processing Language (CPL) scripts. SIP protocol functionality can 
either be provided by centralized components but also at ’’smart” end system 
nodes. Over the last period of intensive work, SIP has emerged towards the core 
protocol of a comprehensive framework, addressing additional features such as 
QoS support, firewall interaction and call routing as well. 

2.3 Application Interfaces - TAPI and JTAPI 

The Microsoft Telephony API (TAPI) JH] has been developed in a joined effort 
by Microsoft and Intel and is provided as part of Win9x and WinNT. Its targets 
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are to isolate features of the underlying hardware from the applications by means 
of a standard API as well as to specify a Telephony Service Provider Interface 
that the underlying services have to meet. 

It supports basic Computer Telephony Integration (CTI) applications like 
automated dialing but starting with the TAPI version 3.0 under Windows 2000 
involves sophisticated features such as IP multicast conferencing, a H.323 stack 
and Interactive Voice Response (IVR) functionality as well. Inherently it is limi- 
ted to the Windows platform though and does not up to now cover SIP, though 
protocol descriptions state, that it provides powerful means to incorporate new 
protocols or protocol extensions as so called Third Party Service Providers. 

The Java Telephony Application Programming Interface (JTAPI) (T5] is an 
object-oriented interface that allows the development of portable telephony ap- 
plications in Java. It uses a modular approach that places additional functiona- 
lity, so-called extensions, on top of a common JTAPI core. The API itself just 
describes interfaces which have to be implemented for the underlying hardware 
or protocol infrastructure. As a current drawback it must be stated that there 
is still only a small though rising number of JTAPI peer class implementations. 

Both APIs have current limitations and are inherently targeted at telephony 
services. We do not consider JTAPI or TAPI as comprehensive alternatives or 
competitors to our approach - they can even be combined with it or provide 
services. 



3 The Abstraction Framework 
3.1 MPEG and DMIF 

MPEG-4 |T0] is a new multimedia standard that is much more powerful and 
comprehensive than the previous MPEG standards. To begin with, it provides 
an object-based description of content, which can be both naturally captured 
and computer generated. Though the term MPEG-4 is often associated with the 
specification of a set of video codecs working with individual visual objects, the 
standard is much more comprehensive. 

Among others, MPEG-4 defines the Delivery Multimedia Integration Frame- 
work (DMIF) 0j@. DMIF is a framework that abstracts and thereby encapsu- 
lates the delivery mechanisms from the applications. 

The frameworks API, called DMIF Application Interface (DAI), works with 
Universal Resource Locators (URLs), which specify appropriate delivery mecha- 
nisms for specific scenarios. URLs can also specify the required network protocol, 
which provides a protocol abstraction for the applications. Additional parame- 
ters of a connection such as e.g. Quality of Service (QoS) requirements can be 
passed as arguments through this generic interface as well. DAI is language and 
platform independent. A basic description of its primitives is given in Table 0 

Additionally, DMIF defines an informative DMIF-Network Interface (DNI) 
for the network related scenarios. DNI allows the convenient development of 
components that can easily adapt their signaling mapping to different protocols. 
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Table 1. DMIF primitives and functionality 



Primitive 


Description 


DALServiceAttach 


Allows the initialization of a service session 
with a remote peer, specified with a URL. 


DALServiceDctach 


Allows the termination of a service session. 


DAIA.ddChannel 


Allows the establishment of end-to-end 
transport channels in the context of a 
particular service session. 


DAIJRemoveChannel 


Allows the replacement of existing transport 
channels. 


DALUserCommand 


Allows the application-to-applic.ation exchange 
of messages. 


DALSendData 


Allows the transmission of media in the 
established channels. 



3.2 Framework Architecture 

Before defining the basic objects of our architecture, it is important to identify 
main use cases, which are typically required by IP Telephony applications. Ba- 
sically a user should be able to register himself in the ” IP-networked world”, 
to enable his locating and identification for other participants. After that, it is 
possible to receive calls or to originate them to remote users. So, in a necessarily 
limited scenario there are three main use cases, which are shown in Figure [D 
using a UML use case diagram. 




Fig. 1 . Basic Use Cases 

As an example for an additional service, we refer to the signaling for ensuring 
the desired QoS. 

From the analysis of the use cases we derive the architecture of the proposed 
framework. It is shown in Figure El 

In our framework the DMIF Application Interface is implemented with two 
interfaces, DMIFSession and DMIFApplication. The first provides the set of me- 
thods that are offered to the application from the DMIF layer, while the former is 
the set of callback functions for the DMIF layer to inform the application about 
events and messages. The DMIF layer is provided to the applications through 
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Fig. 2. Framework Architecture 

the DMIFFilter. Its responsibility is to parse application requests and to activate 
the appropriate DMIFInstances to handle them. 

The DMIFInstances are formed of two different objects: the CallReceiver 
and the CallMaker. A CallReceiver object initially allows the user to register 
himself. It is then responsible for the acceptance of incoming calls as well as 
for their handling. A CallMaker executes the requests for outgoing connections. 
Both CallReceiver and CallMaker behavior is independent of the used signaling 
protocol. They communicate with the appropriate signaling object using the 
DNI interface. Specific implementations of signaling protocols are the SIPComm 
object, which uses the SIP protocol and the H323Comm object, which uses the 
H.323 protocol suite. 

4 Usage Scenarios and Protocol Mappings 

After having identified basic functions we now show how calls can either be 
received or originated using both H.323 or SIP as the underlying signaling pro- 
tocol. The protocol details are hidden from the applications which use the same 
interface and primitives in any of the cases. They - using the appropriate URL 
identifier - just have to specify which of the available protocols should be used . 
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4.1 Registering and Receiving Calls Using H.323 



In this scenario, the IP Telephony Application (IpTelAppl) uses the H.323 pro- 
tocol to register the participant with a Gatekeeper and enables him to accept 
calls. The sequence of protocol steps is shown in Figure 0 



: IpTelAppl 


1 : DMIFFilter 




: CallReceiver Client : 




Gatekeeper : 


H.323 Terminal : 1 






H323Comm 




H323Comm 


H323Comm | 




Fig. 3. Registering and receiving calls - H.323 scenario 

It should be noticed that the DAI interface is used between the IP Telephony 
application and the DMIFFilter and the DNI between the CallMaker and the 
H323Comm object. The IpTelAppl attaches to the local Gatekeeper, by using 
the DALServiceAttachReq primitive, where it can pass the address of it, if it is 
known. The DMIFFilter, parses the passed URL and activates a CallReceiver 
object to handle the details of the operation. The CallReceiver may inquire at 
the local H323Comm object about the location of the Gatekeeper, if its address 
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is not already specified, using the DNLSessionSetup primitive. In this case the 
local H323Comm object broadcasts a request to locate the Gatekeeper. After the 
Gatekeeper has been identified, the CallReceiver object requests the H323Comm 
object to register itself at the local Gatekeeper. The local H323Comm object 
is completing the request, by exchanging one more pair of messages with the 
Gatekeeper (RRQ and RCF). After the successful completion of the registration 
task, a handler to the CallReceiver is returned to the IpTelAppl for further usage. 

Suppose that later another H.323 terminal wants to call the IpTelAppl. It 
can obtain the address of the local H323Comm object from the Gatekeeper. 
Then, it submits a Setup message to the local H323Comm object to request the 
establishment of a new session. The IpTelAppl is informed about this request 
from the CallReceiver object. The IpTelAppl instance can decide to accept the 
new call and the local H323Comm object requests admission from the local 
Gatekeeper (ARQ and ACF messages). After that it replies to the remote H.323 
terminal that it accepts the connection (CONNECT message) . 

A number of H.245 messages follow in order to exchange the capabilities of 
the two terminals. Then, an Open Logical Channel (OLC) message is sent to 
request for a media channel. The CallReceiver indicates this to the IpTelAppl 
(DALAddChannellnd) , which confirms it. Finally, the local H323Comm object 
sends an acknowledge to the remote H.323 terminal. The last procedure might 
be repeated for more media channels. 



4.2 Registering and Receiving Calls Using SIP 

Figure 0 shows the registration of an IpTelAppl in order to receive calls for the 
SIP case. The IpTelAppl requests user registration with the DALServiceAttach 
command. The DMIFFilter therefore creates a new DMIF Instance, the Call- 
Receiver to handle the registration and possible future calls. A CallReceiver 
requests from the SIPComm object to establish a connection with the SIP loca- 
tion sever. The SIPComm object sends a REGISTER message to the Location 
Server to store its location information for future incoming calls. The handler 
returned to the IpTelAppl is used to proceed future interactions. 

Later, when a new INVITE message is received from the SIPComm object 
(Client instance), the IpTelAppl will be informed. It then can either confirm 
the acceptance of the incoming call or reject the new invitation. In the case 
of acceptance, a SIP 200 OK response is replied to the caller, and the call is 
established after the final ACK is received. 



4.3 Call Setup Using H.323 

In this scenario - shown in Figure 0 - the IpTelAppl wants to setup a call to 
a remote H.323 terminal. Only the most important and relevant (to the DMIF 
layer) H.323 messages are shown. 

The IpTelAppl calls the DALServiceAttachReq to originate a new call. It pas- 
ses the URL of the remote participant for symbolically addressing the intended 
communication partner. The DMIFFilter parses the request and creates a new 
CallMaker object to handle the details of this operation. It requests the session 
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: IpTelAppI 



: DMIFFilter 




: CallReceiver 




Client 






: SIPComm 



Location Server 


Proxv or Redirect 


: SIPComm 


: SIPComm 



iDALServiceAttachReq 

newCallReceiver 




DALServiteAttachlnd 

DALServicfeAttachCnf 

DAI_AddCjhannellnd 

DALAddCjhanneICnf 



Notify 




INVITE 



200 |(OK) 



Fig. 4- Registering and receiving calls - SIP scenario 

setup from the local H323Comm object, which communicates with the Gatekee- 
per to request for admission to place the call and to address the remote party 
(ARQ and ACF messages). Then, the CallMaker calls the DNLSessionAttach 
primitive to request the establishment of a new session with the remote terminal 
from the local H323Comm object. A set of messages is exchanged between the 
two H.323 peers, compromising both Q.931 and H.245 protocol elements. At the 
end, the IpTelAppI receives a positive response. 

Once the connection is established, the application may initiate the setup of 
additional media channels with the DARAcldChannelReq primitive. The local 
H323Comm object is negotiating the channels with the remote H.323 terminal 
(OLC and OLC Ack messages). This procedure is repeated between the two 
terminals for every media channel. 
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: DMIFFilter 




: CallMaker 


Client : 




Gatekeeoer : 




H.323 Terminal 






H323Comm 


H323Comm 


: H323Comm 



^ALSe^A^chR^ newCa „ Maker ^ , DN ,_ Se8s|onSe , up 



DALServic e, VttachRsp 



DALAddChannel 



DALAddC is nneIRsp 



DNLSessionAttach 
> 



addChannel 



DNLAddChannel 



CALL PROCEEDING 



H.245 Protocol Messages [\ 



Fig. 5. Originating Calls - H.323 scenario 

4.4 Call Setup Using SIP 

In this scenario - described in Figure 0- the IP Telephony application is setting 
up a new call, using the SIP protocol. No specific details of the protocol are 
required for the application. 

It constructs an appropriate URL, which denotes the intended usage of the 
SIP service. The IpTelAppl requests the DMIFFilter to attach with the requested 
remote participant. The symbolic address of the remote user is passed as a SIP 
URL with the appropriate parameters. The DMIFFilter creates a new CallMaker 
object to handle the signaling task for this new call. The CallMaker interacts 
with the SIPComm object (Client instance) to establish the call (if TCP is used) 
with a SIP proxy or redirect server, using the DNLSessionAttach primitive. The 
Client has to locate the appropriate SIP proxy or redirect server first to request 
the remote user. After the (successful) connection with the server a response is 
given back to the IpTelAppl, with a handler to refer to the same objects for later 
requests. 
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Fig. 6. Originating Calls - SIP scenario 

In a next step the IpTelAppl may add a voice channel. IpTelAppl calls the 
DALAddChannel primitive from the DMIFFilter to request for a channel with 
specific QoS, if this is supported from the underlying network. The previously 
created CallMaker object is identified and is requested to handle the new request. 
The CallMaker maps the Application QoS to the Network QoS and calls the 
DNLAddChannel primitive to ask the SIPComm object to request for a new 
channel with the remote user. SIPComm, interacts with the SIP proxy and 
redirect server to locate and invite the remote user, using the INVITE message. 
In the basic scenario, it will receive a positive SIP response (OK 200), and will 
complete the invitation with the SIP ACK message. 
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5 Conclusion and Future Work 

In the paper we have proposed a framework based on DMIF and possible map- 
pings for common IP Telephony signaling operations. We are not intending to 
develope ”yet another implementation” for one of the emerging signaling stan- 
dards, but try to find a generic approach by identifying basic functionalities. 

Based on our experiences with implementing a protocol gateway between 
H.323 and SIP the ’’conventional way” [I] we assume, that the approach fits 
well with the recent functionality of established signaling protocols while being 
flexible enough to also incorporate changes or cope with even new protocols. 
Though we have concentrated on describing scenarios involving end systems, it 
is applicable for the development of infrastructure components using a variety 
of signaling protocols as well. 

The basic motivation for choosing DMIF is its standardization and the ex- 
perience, that generating software in a standardized instead of a per-application 
or per-protocol way can speed up development and permits more generalized 
solutions. We consider our framework feasible and intend to implement it using 
different underlying protocol stack software thus enabling applications to use the 
described interface and to have means for the evaluation of its performance and 
implementation as well as runtime-overhead. 
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Abstract. In the last decades, progress in microelectronics and VLSI 
technology has fostered the widespread use of computing and communication 
applications in portable electronic devices. Up until now, information transfer 
between these devices has been cumbersome relying on cables and infrared. 
Recently, a new universal radio interface has been developed enabling 
electronic devices to connect and communicate via short-range radio 
connections. The BluetoothTM radio technology eliminates the need for wires, 
cables and connectors between cordless or mobile phones, modems, headsets, 
PDAs, computers, printers, projectors, and so on, and paves the way for new 
and completely different devices and applications. Bluetooth is regarded as a 
complement and an extension to existing wireless technologies, addressing the 
short-range and inter-device connectivity. The technology enables the design of 
low-power, small-sized, and low-cost radios that can be embedded in existing 
portable devices. Eventually, embedded radios will lead towards ubiquitous 
connectivity. 

The Bluetooth radio operates in the unlicensed ISM band at 2.45 GHz. This 
band has to be shared with other applications that spread radio energy which 
can disturb the Bluetooth link. Several design features have been implemented 
to make the Bluetooth air interface as robust as possible against potential 
interferers. The Bluetooth radio applies frequency hopping with a nominal 
hopping rate of 1600 hops/s. Instantaneously a narrowband channel of 1 MHz 
is used to transfer the data packets; every 625ms, this channel is positioned on a 
different carrier selected according to a pseudo-random hop sequence covering 
79 carriers that span the 80 MHz ISM band. The interface supports both 
synchronous links suited for circuit-switched applications like voice 
communications and asynchronous links suited for packet-switched 
applications like data. The voice and data protocols have been optimized to 
deal with interference present in the unlicensed band. Attention has been paid 
to low-power modes to maximize battery-life in portable devices. 

This paper gives an overview of the Bluetooth radio interface. It addresses key 
design issues that make Bluetooth a unique air interface that differs from 
existing wireless technologies. Examples are the support for ad-hoc 
connectivity, the application of an unlicensed band, the low power 
consumption, and the enabling of single-chip radios. The paper also presents 
some user scenarios and future developments 
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Abstract. This paper presents an architecture for QoS-aware middleware 
platforms. We present a general framework for control, and specialise this 
framework for QoS provisioning in the middleware context. We identify 
different alternatives for control, and we elaborate the technical issues related to 
controlling the internal characteristics of object middleware. We illustrate our 
QoS control approach by means of a scenario based on CORBA. 



1. Introduction 

The original motivation for introducing middleware platforms has been to facilitate 
the development of distributed applications, by providing a collection of general- 
purpose facilities to the application designers. Currently, commercially available 
middleware platforms, such as those based on CORBA, are still limited to the support 
of best-effort Quality of Service (QoS) to applications. This constitutes an obstacle to 
the use of middleware systems in QoS critical applications, or in case services are 
offered in the scope of Service Level Agreements with strict QoS constraints. This 
limitation in the available middleware technology has inspired much of the research 
that is currently being done on QoS-aware middleware platforms. 

Ideally, a middleware platform should be capable of supporting a multitude of 
different types of applications with (a) different QoS requirements, (b) making use of 
different types of communication and computing resources, and (c) adapting to 
changes, e.g., in the application environment and in the available resources. The 
architectural framework presented in this paper has been developed to be flexible and 
re-usable. The main benefit of our framework is that it allows us to combine and 
balance solutions for the control of multiple QoS characteristics. 

From the research perspective, a framework-based approach supports the 
incremental introduction of new solutions such as control algorithms, and allows us to 
compare different solutions in the same setting. From a middleware developer’s 
perspective, this approach is attractive because it supports incremental development 
and the construction of product families in which different family members address 
different sets of QoS characteristics. 



* This work has been carried out within the AMIDST project ( http://amidst.ctit.utwcntc.nl/ ). 
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This paper identifies the problems that have to be solved in order to elaborate our 
architectural framework, and discusses some techniques that can be used to solve 
these problems. The use of the architectural framework is illustrated by means of a 
scenario. 

This document is further structured as follows: Section [)] introduces some basic 
concepts and discusses the role of a QoS-aware middleware when supporting object- 
based applications; Section ^introduces our approach and its background, which 
stems from control theory; Section [o] discusses the technical issues that have to be 
addressed to realise the proposed framework, presents some requirements for each of 
these issues and indicates our solutions to fulfil these requirements; Section |o] 
illustrates the use of our architectural framework with a simple CORBA application 
using a naming service, and shows how QoS requirements can be enforced by the 
middleware platform; and Section [ijdraws our conclusions. 



2. Concepts of QoS-Aware Middleware 

This section discusses the concepts that underlie our QoS-aware middleware 
architecture and identifies the role of a QoS-aware middleware platform in the 
support of distributed applications. 



2.1 Distributed Applications 

Distributed applications supported by a QoS-aware middleware consist of a collection 
of interacting distributed objects. Since in this paper we concentrate on the support of 
operations (invocations of methods), we assume that an object may play the role of 
either a client or a server on an interface. We also assume the ODP-RM 
computational model, in which objects may have multiple interfaces [[ 7 ]. 

During the development of a distributed application, the interfaces of the 
application objects have to be specified. In principle this specification should define 
the attributes and operations of these interfaces. In the case of CORBA, one only 
specifies the server interface using IDL, and makes use of this specification for 
creating stubs and skeletons, or to dynamically create requests for operations. 
However, in general one could specify server and client interfaces, and define rules 
that can determine whether a server interface is capable of servicing a client interface 
[ |l 4| Extensions of IDL that allow the definition of both client (required) and 
server (off ered) interfaces have already been proposed in the CORBA component 
model [ |13J . 

When considering QoS-aware middleware, we suppose that the interface 
specifications are extended with statements on QoS that can be associated with the 
whole interface or with individual operations and attributes. In the case of a client 
interface, these statements describe the required QoS, while for a server interface 
these statements describe the offered QoS. QML 1^1 and QuO j[li| l are languages that 
allow one to specify the QoS associated to interfaces. 
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2.2 Objects Life-Cycle 

After the objects of a distributed application have been implemented, the application 
is deployed. We consider the general case in which persistent objects and late binding 
can be supported by the middleware platform. In this case, an object has the following 
life cycle: 

1. Object creation, in which interface references for the server interfaces of an object 
are created and can be referred to by other objects; 

2. Object activation, in which an object starts execution, which implies that all local 
resources necessary for the object to execute should be properly allocated; 

3. Object deactivation, in which local resources allocated to an object may be 
released, although the interface references may still be valid in case persistent 
objects are supported; 

4. Object destruction, in which the object is deactivated (if it is still active) and its 
interface references are destroyed. 

A QoS-aware middleware platform can use object activation to refine the offered 
QoS, by restricting the ranges originally described for the offered QoS at design and 
implementation time. The run-time status of the middleware platform and the 
communication and computing resources should make it possible to determine this 
offered QoS more precisely. 



2.3 Explicit Binding 

Object interfaces have to be bound to each other in order to allow these objects to 
interact through the middleware. In CORBA, this binding happens implicitly when 
the client object issues a request (implicit binding). 

For QoS-aware middleware platforms, however, implicit binding is not desirable, 
since the QoS requirements may demand that resource allocation procedures are 
performed before the request is executed. Unfortunately, we can not predict the speed 
and reliability of these procedures. In the worst case, we may still have to activate the 
server object. This means that we can not always guarantee the QoS requirements by 
using implicit binding. Therefore, in QoS-aware architectures explicit binding is 
necessary, which consists of taking explicit actions at the computational level in order 
to establish the binding before interacting jv]. Our case for explicit binding for QoS- 
aware operation support is somewhat similar to the reasoning in Q for stream 
interface bindings. 

The client object requests the establishment of the binding, giving to the 
middleware a reference to a server interface. This request also contains the required 
QoS, which can be retrieved from a QoS specification repository. The middleware 
platform then searches for the server object. In case the server object has not been 
activated, the middleware platform activates this object and continues the 
establishment procedure. Otherwise, the middleware platform compares the offered 
QoS with the required QoS and uses its internal information to determine an agreed 
QoS. This process is called QoS negotiation. In case the binding establishment has 
been successful, the client and server objects are informed that a binding has been 
built. From this moment on these objects can interact through the binding. Figure 1 
shows the establishment of a binding using a QoS-aware middleware. 
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Fig. 1. Binding establishment using a QoS-aware middleware platform 

The agreed QoS is determined by considering the required QoS on one hand, and the 
composite QoS capabilities of the server object (the offered QoS) and the middleware 
platform on the other hand. The agreed QoS serves as a contract between the 
application objects and the middleware platform, which should be respected during 
the operational phase when the objects interact through the binding. 

The binding establishment may also result in the creation of a binding object. This 
object binds the client object and the server object, and offers a control interface that 
allows, for example, the inspection and modification of the agreed QoS. In this paper 
we ignore the adjustment of the agreed QoS through such an interface, but this is an 
interesting topic for further work. 

Our QoS control approach considers that a binding has been successfully 
established and that the agreed QoS has to be maintained. The middleware is 
responsible for that, and is constantly adjusting its internal characteristics and the 
usage of computing and communication resources in order to achieve it. 



3. Control Framework for QoS Provisioning 

This section introduces our approach and subsequently discusses a specialisation of a 
generic control system model for the purpose of controlling QoS in a middleware 
context. 



3.1 Approach 

The design of our QoS-aware middleware architecture is constrained by two 
conflicting requirements: a) the architecture has to be flexible enough such that it 
enables us to experiment with different QoS strategies and cope with different kinds 
of application demands; and b) certain aspects of the architecture have to be fixed so 
that the robustness and portability of the architecture can be guaranteed. 

For this reason we start off with a generic control system model, which we 
specialise, such that it applies to QoS-control in a middleware context. This 
specialised model forms our architectural framework, i.e. the fixed part of our 
architecture. Although some decisions are made with respect to the scope of control, 
the architectural framework is independent of any specific QoS-control strategy or 
algorithm. Therefore, different solutions can be compared and evaluated with this 
framework. 
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A synthesis-based approach [ p~6] can be used to arrive at a complete QoS-control 
architecture, e.g. for a specific application or system environment. In this approach, 
requirements are converted into technical problems. For each technical problem, 
possible solution techniques are sought. The candidate solution techniques are then 
compared with each other from the perspective of relevance, robustness, adaptability 
and performance. Whenever a suitable solution technique is found, the fundamental 
abstractions of this technique are used to derive the architectural abstractions. This 
process is repeated until all the problems are considered and solved. Finally, the 
architectural abstractions are specified and integrated within the overall framework. 
Since solution domain knowledge changes smoothly, this approach provides us with 
stable and robust abstractions with rich semantics. The discussion of technical issues 
in Section 4 partially illustrates this approach. 



3.2 Generic Control System 

The main objective of QoS-aware middleware is to establish and enforce an agreed 
QoS that satisfies the demands of applications, given the available resources. We 
observe this is essentially a controlling problem, and therefore the QoS-control 
framework should be synthesised from the fundamental abstractions of control 
systems. 

A control system consists of a controlled system in combination with a 

controller. The interactions between the controlled system and the controller consist 
of observation and manipulation performed by the controller on the controlled system. 
The building blocks of the control process are shown in figure 2: 




Fig. 2. Building blocks of a control process. 



The generic control model abstracts from the type of observation and the type of 
manipulation that can be employed by the controller on the controlled system. The 
relationship between the controlled system and the controller can be realised using 
different strategies. With a feed-forward control strategy, manipulation through 
control actions is determined based on manipulation of the input to the controlled 
system. A feed-back control strategy can be applied for behaviour optimisation. 
According to this strategy, measurements of the output delivered by the controlled 
system are compared with a desired behaviour (a reference) and the difference 
between them is used by the controller to decide on the control actions to be taken. 
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3.3 QoS-Control System 

In QoS-aware middleware, the ’controlled system’ is the middleware functionality 
responsible for the support of interactions between application objects, while the 
’controller’ provides QoS control. Here, the environment represents the operational 
context of the middleware, which consists of application objects with QoS 
requirements and QoS offers. The middleware platform encapsulates the computing 
and communication resources at each individual processing node, which may be 
manipulated in order to maintain the agreed QoS. 

Figure 3 shows the specialisation of the generic control model for controlling the 
QoS provided by a middleware. 




Fig. 3. QoS-control architecture. 



In Figure 3 we identify two symmetrical structures, one for handling QoS 
measurement concerns and another for handling QoS manipulation concerns. A probe 
is a point of observation or manipulation that is available or must be planted in the 
controlled system, i.e., the middleware platform. Many probes may be planted in the 
controlled system, for both observations and manipulations. 

A sensor is a mechanism that uses a probe to obtain observations. Observations can 
only be useful if they are interpreted in terms of measurements that can be compared 
with the reference, i.e., they are represented using the same units and have the same 
semantics. For example, observations can be time moments of the sending of a 
request and the receiving of the corresponding response. The needed measurement 
could be the average response time, which implies that the average of the difference 
between the time moments observed should be calculated in order to generate the 
measurement. This calculation is performed by an interpreter. In general, the 
interpreter combines observations, which could even come from different sensors, in 
order to generate measurements. 
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A comparator compares the measurement and an associated reference value (an 
agreed QoS value), determining the difference. A decider gets the difference and 
applies some algorithm to establish a control strategy, consisting of the objectives to 
be reached in this execution of the control loop. The control strategy must be 
translated in a collection of control actions, i.e., manipulations of the controlled 
system. A translator is responsible for translating the control strategy to a collection 
of control actions. An actuator schedules the control actions such that they are carried 
out using one or more probes. The translator distributes the control actions among the 
actuators, realising in this way the control strategy. 



4. Technical Issues 

This section identifies and elaborates a number of the technical issues that have to be 
addressed in order to realise the QoS control architecture. The first 5 issues 
correspond to the realisation of the components in our architecture. We explain the 
requirements, and propose some possible solutions and solution strategies. 



4.1 Collecting Observation Values 

In order to collect observation values we have to develop probes and sensors. Probes 
connect the middleware to the control mechanisms and are independent of the actual 
measurements. Sensors collect the actual measurements and they typically depend on 
the amount and types of data that are collected. 

The fundamental requirement on the probes and sensors is that they must have a 
minimal impact on the middleware platform. This introduces two issues: (a) how to 
minimise the impact on the middleware code, and (b) how to map theprobes to one or 
more specific places in the middleware code (cross-cutting problem I£J). 

Reflection is a technique in which a system is explicitly represented in terms of a 
meta-object, allowing one to manipulate the (structure of the) system by manipulating 
its meta-object. A reflection-based approach suits well to the collection of observation 
values. 

Crosscutting of concerns requires either careful documentation and management of 
probe insertion points, or entirely new tools and techniques for specifying and 
implementing crosscut concerns. Recent work in the area of Aspect-Oriented 
Programming (see, e.g., [|oJ ) addresses these issues. 



4.2 Interpretation of Observations 

The interpretation process depends on many factors: the involved observation data, 
the required measurements, and the rules or strategies for interpretation. The number 
of interpretation rules and their complexity also determine the interpretation process. 

The interpretation part should not become a possibly large collection of 
unstructured ad-hoc code. This implies that a generic model should be developed to 
define how observations are translated to measurements, such that interpretation code 
can be reused or generated automatically as much as possible. In case statistical 




124 L. Bergmans et al. 



information determines measurements, a lot of input data may be required, such that 
the amount of storage and processing should be as much as possible reduced. 

The interpretation process is essentially a transformation from a set of input values 
to a set of output values. The variation in input values lies both in sources, types and 
time, and depends on the sources of the input, i.e., how the middleware has been 
designed. The resulting output should be independent of the specific implementation 
details of certain middleware and applications, and it should be suitable for the 
comparison process. This means that a common QoS meta model should be available 
that determines the types and values of both measurements and the references. 

The interpretation of observations can be done through calculations, heuristics 
(logic rules), stream interpreters or conversions. We need to model these different 
techniques in a uniform way, with explicit dependency relations to a structured 
representation of the observations and measurements. Interpretation rules should all 
be a specialisation of a single abstraction, i.e., the interpreter. Each individual 
instantiation can be considered as a micro-interpreter. For each QoS measure, there 
should be a clear specification of the interpretation rules in terms of formulas or 
guidelines. In Figure 4 at the end of this section the extensions to our architecture to 
meet the needs of interpretation are shown. 



4.3 Determination and Representation of the Difference 

The comparator compares the measurement with the reference model and determines 
the difference. This comparison can vary from subtraction in the simple case of one 
QoS characteristic with a numeric value, to complex calculations possibly using 
heuristics in the case of multi-faceted QoS characteristics. The main task of the 
comparator is to deliver an abstraction of the ‘problem to be solved’ that is as far from 
the implementation details of the environment as feasible. 

The difference produced by the comparator serves to detect (potential) violations of 
the QoS. Such violations depend on the agreed QoS. Hence, the difference must be 
obtained by comparing the actual measurements with corresponding references 
specified by the agreed QoS. 

The difference could be represented as a ‘distance’ vector, where each element of 
the vector corresponds to a relevant QoS characteristic. 

Measurements and references should be described in such a way that they can be 
compared (see Section [jj. For this purpose we use a QoS meta-model, which consists 
of a collection of concepts that allow one to specify both the measurement and the 
reference, and the difference. Another benefit of having a QoS meta-model is the 
ability to build QoS specification repositories. We adopt and adapt QML 0 to 
specify the QoS meta-model and its instantiations. 

Figure 4 illustrates the use of the QoS meta-model in our overall architecture. The 
agreed QoS is determined before entering the operational phase, through negotiation 
based on QoS requirements, QoS offers and the capabilities of the middleware 
platform. In this paper we assume that the agreed QoS is not modified during the 
operational phase. 
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4.4 Controlling Algorithm 

The difference or distance vector computed by the comparator may -or may not- 
define a situation that requires controlling (i.e., correcting) actions to be taken. The 
controlling algorithm is responsible for selecting an appropriate strategy. The strategy 
to be chosen depends partially on the specific state and configuration of the 
middleware. Rather than mixing middleware state and configuration information with 
the measurements and difference, this information must be available independently. 
For this reason, we introduce a middleware control model. This model is an 
abstraction (model at a meta-level) of the middleware, which specifies what can be 
parameterised or tuned in the middleware, or which components can plugged in, 
deactivated and activated. 

The task of the decider is two-fold: firstly to ensure that the agreed QoS can indeed 
be supported by the middleware platform, and secondly to optimise the overall QoS 
characteristics, by balancing the different, often contradictory, requirements. In its 
most general form, controlling is an artificial intelligence task that involves domain 
knowledge and heuristics about managing and controlling QoS, and the 
interdependencies between QoS characteristics. 

We have not selected a particular solution for the controlling algorithm: our goal is 
to offer a framework that allows the experimentation with -combinations of- different 
techniques such as mathematical algorithms, heuristic rules and the use of fuzzy logic 
as a means of expressing and reasoning about weak but conflicting optimisation [ pT| . 
Figure 4 shows the extension of our architecture with an explicit middleware control 
model. 



4.5 Control Strategy and Middleware Manipulation 

A control strategy is the output of the controlling algorithm, and it should be an 
implementation-independent representation of the solution strategy for pursuing 
certain QoS characteristics. Control strategies are strongly related to the controlling 
algorithm. 

Control actions are abstractions that represent concrete functional behaviour, but 
are independent of the implementation details of the specific middleware software. 
Control strategies represent sets of control actions that are to be applied to the 
middleware in a co-ordinated way. The representation of control strategies must 
consist of at least the following parts: a) set of control actions; b) a set of probes in the 
middleware where the control actions can be applied, and c) a co-ordination 
specification, which could be a script or any other form of executable specification. 

There are a few ways to affect the behaviour of a running system like a middleware 
platform: a) by invoking operations of a local API; b) by modifying the internal state 
of the system, c) by replacing components of the system with different 
implementations, and d) by meta-level manipulation of the system itself. A control 
action can only be a specialisation or instantiation of one of these. 

The implementation of control actions through actuators and probes introduces 
technical issues comparable to the ones discussed in section |o| and therefore they are 
not discussed further. 
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Fig. 4. A detailed version of the architecture incorporating some of the enhancements that are 

discussed in this section. 



4.6 Feasibility of the Overall Control Loop 

The performance overhead introduced by our architectural framework has to be 
carefully considered when using the framework in practical settings. The technical 
solutions should not make the overall QoS worse than what it would be without them. 
Several QoS requirements are related to performance (e.g., delays and throughput). 
Implementations of our architecture may require a lot of additional activities and 
overhead, which may conflict with the QoS requirements they try to enforce. By 
adopting a tailorable framework approach, we may choose to build instances of the 
framework with components ranging from simple, low-overhead components up to 
complex components. This approach can help coping with the performance overhead 
by using more efficient versions wherever necessary. In the future, the use of a meta- 
controller to switch dynamically between different versions may be considered. 

Feed-back control loops may make the controlled system oscillate between two 
undesirable states, depending on the corrective measures and their effects. In some 
cases, mathematical models based on control theory can help predicting whether the 
system is stable during operation, allowing one to avoid oscillation. In case 
mathematical models are not available or are not precise enough, some heuristics may 
show whether the system is stable or not. Alternatively, additional (meta-level) 
controllers could be introduced to detect instability and take measures to avoid it, e.g., 
by actuating on the controlling algorithm. The use of fuzzy logic in the controlling 
algorithms may also help to avoid that the control loop oscillates during operation. 
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5. Scenario 

This section demonstrates our architectural framework by means of a scenario for a 
simple CORBA application using a naming service. 



5.1 Scenario Set-Up and Use of QML 

Figure 5 depicts a QoS provisioning scenario for a CORBA naming service 
application. The application consists of a client object that intends to invoke a method 
on a NamingContext object through a QoS-aware ORB. 






Performance: 
Delay < 100 
Rate > 5 




interface NamingContext { 
void bind (in Name n, in Object obj ) ; 
void bind context (in Name n, ...); 
Object resolve (in Name n) ; 


} 


A 

Performance: 
Delay < 60 
Rate < 7 


/ 



QoS-aware ORB 




Fig. 5. A naming service application scenario. 



The offered QoS of the server object and the required QoS of the client object are 
depicted in a simplified form: 

• the client object requires a delay (time necessary to complete a request) smaller 
than 100, and a supported rate (number of requests per time unit) of at least 5; 

• the NamingContext object offers a delay smaller than 60, and a supported rate 
up to 7. 

We use QML [4] to express the required or offered QoS of an object. The QoS is 
specified using the QoS dimensions of a QML contract type. Figure 6 shows a 
possible QML contract type that defines the relevant QoS characteristics in this 
scenario, viz. delay and rate. 



type PerformanceType - contract { 
delay : decreasing numeric msec; 
rate: increasing numeric req/sec; 

} ; 



| PerformanceType contract { 


delay < 60; 


/ / maximum de 1 ay 


rate > 7 ; 

}; 


//minimum rate 



Fig. 6. A QML contract type 



Fig. 7. A QML contract 
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QoS contract types should be defined by the QoS meta-model discussed in Section 4. 
The QoS that an application object requires or offers is specified by a QML contract. 
A QML contract puts constraints on the dimensions of the corresponding QML 
contract type. Figure 7 shows a possible instance of the QML contract type of Figure 
6. This contract corresponds to the offered QoS of the naming context object in Figure 
5; the required QoS of the client object can be defined in a similar way. 



5.2 Binding Establishment 




Fig. 8. An established binding with an agreed QoS. 



During the binding establishment phase, the client object requests the establishment 
of a binding with the NamingContext object. Whether such a binding is successful 
depends on whether the application object and the middleware platform together can 
satisfy the required QoS of the client object. If so, an agreed QoS is established and 
the middleware should take appropriate actions, such as: 

• update of the QoS reference base (see Figures 3 and 4); 

• instantiation of sensors that can be used for measurements during the operational 
phase; 

• instantiation of actuators and/or other configuration settings to prepare support for 
the agreed QoS. For example, when a configurable transport protocol is used, a 
connection with certain characteristics may be set up. 

Figure 8 depicts a possible result of a successful binding establishment phase, where a 
QoS is agreed and a binding object is created that is aware of the agreed QoS. The 
offered QoS of the middleware in this case could be an (added) delay less than 20, 
and a supported rate less than 100. For the QoS characteristics in this scenario, the 
constraints on the negotiation process that led to the agreed QoS are as follows: 

• delay: QoS offered (server) + QoS offered (middleware) < QoS agrced < QoS requta!d 

• rate: QoS requlrcd < QoS agrced < minimum(QoS offered (server), QoS offered (middleware)) 
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5.3 Operational Phase 

During the operational phase, the client object invokes requests and obtains replies 
from the NamingContext object. Invocations may trigger measurements or the 
installation of timers to measure the actual offered QoS. If the measured QoS 
approaches certain thresholds related to the agreed QoS stored in the QoS reference 
base, the control system may take precautions, either proactively or reactively. 
Proactive control actions are taken when a “danger zone” has been entered, but before 
a violation has been detected on the agreed QoS; reactive control actions are applied 
if a violation of the agreed QoS has been detected. Control actions include scheduling 
of a request on a high-priority thread, transmitting the request over a transport 
network with priority routing, installing a protocol plug-in that takes advantage of the 
network QoS by differentiating between the priority of network packets, or optimising 
delivery to multiple recipients through a multi-cast protocol. 

For example, the handling of QoS during the operational phase, where (for reasons 
of space) we focus on delay, involves the following steps: 

• To determine the actual delay, we can measure the time that elapses between the 
sending of a request and the receipt of a corresponding reply. Sensors are 
responsible for collecting this information from relevant probes (e.g., timer, 
interceptor) planted in the middleware platform. 

• The interpreter translates the observations received from the sensors into 
measurements that can be usefully compared to the agreed delay (reference) value 
that was established for this binding. For example, the delay observations in a 
certain time period may be used to compute an (average) delay measurement. 

• The comparator compares the measurement values with the corresponding 
reference value. The result could be an element in a distance vector: 

< ADelay, ARate>, where ADelay = Delay re/erence - Delay meamrtd 
In a concrete case, we may have a delay ‘distance’ 20; this means that the actual 
offered delay has reached 80% of the maximum allowed delay. 

• The controlling algorithm must decide, based upon the difference values, and other 
state information, how to deal with the situation. In this case, we assume that over 
80% of the maximum delay is the ‘danger zone’, which requires specific actions to 
speed up the transport of the request/reply messages. A suitable control strategy 
that may improve the delays, is the activation of a faster/prioritised transport 
protocol (e.g., RSVP [6]). 

• The translator uses state and configuration information from the middleware 
control model to determine the availability of the appropriate transport protocol, 
and the location where it should be plugged-in. The resulting control actions 
consist of plugging in the protocol on the client side, plugging in the protocol on 
the server side, and setting the priorities upon this transport plug-in. 

• The control actions are performed by the actuator, which needs to access the 
middleware software for plugging in the protocols, and must perform the right 
priority/delay settings for each of the protocol instantiations. The probes used by 
the actuator can be APIs and/or global variables. 

The steps that have been described here are performed repeatedly, either triggered by 
an internal clock, or by events (timers that expire, requests that are sent or received, 
etc). 
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6. Conclusion 

In this paper we have presented an architecture to support QoS-aware middleware. 
We have introduced some assumptions and general concepts for using QoS-aware 
middleware. The key part, and focus of this paper, is the QoS-control system. The 
QoS controller in this system observes and, if necessary, manipulates the state of the 
controlled system, i.e. the middleware platform that supports distributed applications. 
The design of the QoS controller is an architectural framework that is based on 
models from control theory. This should ensure its stability with respect to evolving 
requirements, and its applicability to a wide range of controlling techniques. 

The QoS-control architecture was discussed in more detail by examining a number 
of technical issues that must be addressed when realizing the proposed architecture. 
For each of these issues, we discussed requirements and corresponding solutions or 
solution approaches. We illustrated our proposal by describing a simple example of an 
application with QoS requirements, and how this would be dealt with in the proposed 
architecture. 

An initial proof of concept of our approach has been performed in 0 The 
prototype, based on the ORBacus implementation of CORBA, measures the QoS by 
using Portable Interceptors during system operation, and controls QoS at the transport 
level. Control actions are performed through a pluggable transport protocol that 
prioritizes IP packets using DiffServ [10] features. 

In the paper, we hinted at several topics for interesting future work. These topics 
address the further development and prototyping of our control architecture, as well as 
exploring controlling strategies and algorithms that could not be considered so far. In 
addition, we like to profit from results of related works: 

• One of the characteristics of our proposal is that the architecture is largely 
independent of the specific implementation architectures of middleware systems. 
The QoS controller is separate from the middleware (and applications) and may 
interact with these through a number of probes (a generic term for interfaces that 
abstracts from specific implementations). Conceptually (and possibly 
implementation-wise), this is a reflective model; our QoS controller observes and 
manipulates the middleware at a meta-level. Several other proposals for reflective 
middleware have been made, e.g. |^]. 

• A middleware framework for QoS adaptation has been described in fill . Both a 
task control model and a fuzzy control model have been used in this framework to 
formalise and calculate the control actions necessary to keep the application QoS 
between bounds. This framework shares many design concerns with our 
framework, although it has been targeted to the control of applications. 

• OMG currentl y de velops Real-time CORBA facilities in the scope of the CORBA 
3.0 standard |15| . These facilities allow one to manipulate some middleware 
characteristics that influence the QoS, such as, e.g., the properties of protocols 
underlying the ORB and the threading and priority polices applied to the handling 
of requests by server objects. These facilities are defined in terms of interfaces that 
have to be implemented in the middleware platform, generalising in this way the 
control capabilities of the platform. 
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Abstract. In this paper, we describe an architecture for a scalable dis- 
tributed server management system which is capable of managing media 
assets stored on large-scale media server networks. In addition to the fun- 
ctionality for a decentralized management of the whole server network, 
the server management system allows to make bandwidth reservations 
even in advance for streaming in real-time and transmitting media as- 
sets. This increases the QoS for end-users since bandwidth reservations 
for streaming applications can be planned and established a long time 
before the actual streaming process takes place. Moreover, this approach 
allows a tight estimation of the duration required for copying media as- 
sets among different servers of the network. In order to minimize the 
required amount of network bandwidth for copying media assets and to 
reduce the total number of copying processes, the management software 
provides the functionality to place media assets on the servers of the 
network in a way that assets of particular interest for a certain client 
are located on the client’s local server. In this paper, the architecture of 
the server management system itself is presented together with the ap- 
proach used to provide QoS guarantees for end-users. The overall system 
is using the differentiated service model. Besides the technical concept 
we also present results of the integration of the system into the corporate 
network of a german company. 



1 Introduction 

In large scale media streaming applications, a central server for the storage and 
delivery of broadband media streams keeps the administrative effort low but 
has the obvious disadvantage of a limited scalability, concerning both network 
and storage resources. Especially when MPEG-2 encoded video streams are used, 
network and storage requirements pose severe constraints on the hardware infra- 
structure. Compared to a central server system, a network consisting of several 

* This work was supported by a grant from Siemens AG, Munich in the framework of 
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media servers has the advantage that it can be scaled to a large extend and 
allows to support a large number of widely distributed clients but requires addi- 
tional effort for the content management. Thus, an important aim is to develop 
mechanisms for the automatic management of broadband media content in a 
network of distributed media servers which can be connected with each other 
using the available network infrastructure such as Internet or satellite links. 




The management software described in this paper organizes a media server 
network such that one or more servers are used as caches for a single island 
(i.e. an environment which provides the necessary network infrastructure for 
streaming high-quality media assets). The media assets are mapped onto the 
available servers in a way, that the caches store the media assets most likely 
being accessed from within such an island. Thus, each client has easy and cheap 
access to any media asset stored on a media server that is located within the 
same island as the client. The mapping is based on knowledge about clients 
preferences, e.g. subscriptions of content categories or user profiles. Changes in 
users behavior and new media assets require a remapping of the media assets, i.e. 
file transfers between the different servers. Clients inside islands also have access 
to the assets stored on the servers located in remote islands or somewhere in the 
backbone network (see Figure®. However, streaming from a remote server might 
not be possible due to insufficient bandwidth availability or network connections 
that are available only for a certain time such as satellite links. Therefore, in 
these cases assets must be transferred to the client’s local media server. 



134 L.-O. Burchard and R. Liiling 



For transferring media assets from one server to another, timing constraints 
must be considered, i.e. deadlines for the delivery of the media assets to the 
destination. For example, in a teleteaching environment the media assets requi- 
red during lectures must be available when the lectures start. This means, in case 
a client requests a media asset from a remote server, the server management sy- 
stem has to provide an estimation of the time of availability at the client’s local 
server. In order to deal with unreliable network connections and changes of the 
available bandwidth, the durations can be largely overestimated. However, this 
leads to unrealistic figures concerning the estimated times and to an inefficient 
network utilization. In contrast to this, the approach described in this document 
is to use network resource reservations also for file transfers thus allowing to plan 
transfers of media assets in advance and to tightly estimate the durations. 

In this document, we present the Distributed Server Management System 
( DSMS '), a management software for large scale media server networks. The 
system provides a mechanism for resource reservations on the network for both 
streaming and non-streaming traffic based on differentiated services. This allows 
to plan the transfers of media assets and streaming traffic in advance and to 
provide reliable information about the duration of file transfers. In addition to 
that, the DSMS allows the decentralized administration of the media servers in 
the network, i.e. media assets can be installed on each server independently. The 
DSMS keeps track of the media assets stored on the servers and ’publishes’ them 
to the whole network. 

The rest of the paper is organized as follows: after discussing related work, 
we provide an overview about the current application environment of the DSMS 
in a german company. Following that, the system architecture and the advance 
reservation mechanisms are presented. Finally, we give some conclusions and 
briefly discuss future work. 



1.1 Related Work 

In order to keep the requirement low for copying large media files between diffe- 
rent servers, it is necessary to map the media assets onto the server network in 
a way that the need for transferring media files across the network is minimized, 
which means that assets of particular interest for a client are located close to 
that client. The methods and algorithms for such an optimization are based on 
the results described in |7|1 Hj . In these papers, the mapping problem is addres- 
sed using simulated annealing heuristics. The work presented in those papers is 
not restricted to the business TV scenario described in this document, but also 
discusses techniques for TV Anytime server networks jEJ, i.e. systems that store 
broadcast media assets such as free TV programmes and allow clients to access 
those assets at any time. In contrast to the business TV environment, in the 
TV Anytime scenario (which is another field of application for the DSMS) the 
requirement for efficiently utilizing the available network and storage capacities 
of the server network is much more important, due to the enormous amount of 
media assets. 
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The foundations for using advance reservations on networks, i.e. reservations 
of resources before they are actually required, have been discussed in several 
papers. In general, those publications concentrate on bandwidth reservations 
for streaming, i.e. video-conferencing and video-on-demand applications. To the 
authors’ knowledge, the requirement for establishing QoS even for the transfer 
of large media files has not been considered, yet. 

In Q, a model for implementing advance reservations on networks is de- 
scribed. Several issues of the design of an advance reservation mechanisms are 
presented, e.g. different channels or partitions for immediate and advance reser- 
vations. The general requirements for the allocation not only of network resources 
but also computing resources in server and client systems are addressed in El, 
discussing the overall scenario in detail. 

Several publications discuss extensions of the RSVP JJJ protocol in order 
to enable advance reservation mechanisms. In m some basic requirements for 
implementing advance reservations are discussed, using RSVP as an example for 
resource reservation protocols. However, a general model is not presented and 
an actual implementation is not addressed. The focus of pj is not on a general 
model, but on the implementation of advance reservation mechanisms on top of 
existing protocols such as RSVP and ATM. In PEH, several issues concerning 
the provision of advance reservation mechanisms are discussed such as admission 
control or extensions to the existing RSVP protocol. These papers propose an 
agent-based architecture for advance reservations where each routing domain 
in a network contains an agent responsible for admission control on behalf of 
the routers in the network, similar to the approach chosen in this document. 
However, the considerations in these papers only take streaming applications 
into account, the transfer of large data files is not considered. 

The use of differentiated services (DiffServ, [H) for advance reservations, as 
described in this document, has not been addressed so far. 

In the following Sections, we present the application environment in the cor- 
porate network and its requirements, followed by the description of the DSMS 
and the approach to provide the required quality-of-service for streaming and 
transmitting media assets in the server network. 

2 Application Environment 

The DSMS is used to manage networks of media servers used for high-quality 
streaming applications, i.e. using MPEG-1 or MPEG-2 encoded streams, in the 
area of teleteaching, TV Anytime, or business TV applications as described in 
this document. In the following, the current application of the DSMS in the 
corporate network of Pixelpark AG, Germany, which was implemented in the 
framework of the project HiQoS (’’High Performance Multimedia Services with 
Quality-of-Service Guarantees” ) will be described together with the requirements 
for QoS guarantees on networks in this environment. 

Generally, the server network consists of different islands , each containing at 
least one media server and a number of clients which stream media assets from 
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the local media server. In this environment, media servers capable of delivering 
MPEG-1 and MPEG-2 encoded streams are used. Within each island (in this 
particular application, islands correspond to branch offices of the company) the 
available bandwidth is usually considered to be sufficient for a small number of 
streaming processes. However, especially in case of MPEG-2 encoded streams 
with at least 4 MBit/s, resource reservations can be required even within a 
single island in order to guarantee the jitter-free display of the streams. These 
reservations can be made using the hardware support for differentiated services 
which is available in the corporate networlQ. 

The different islands which build the server network are connected with each 
other using Internet links which are obviously insufficient for streaming high- 
quality media assets in real time. Besides using physical networks, as in this 
scenario, the DSMS provides the opportunity to use satellite connections between 
the servers, which then lead to hierarchical networks in which the single islands 
are updated from a central server system jS] . 
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Fig. 2. The Architecture of a Media Server Network managed by the DSMS 



For the clients, the access to the media streams stored on the whole server 
network is provided using a web application that connects to the server manage- 
ment system in order to locate media streams in the server network. In case the 
requested media asset is not available from the local server, the DSMS determi- 
nes the estimated time for the transmission to the client’s local server. Then, the 



1 The usage of IntServ was not supported by the network hardware in this environment 
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client can choose whether or not to ’order’ the transmission of the media stream 
to the local server. In case, the requested stream is available with different qua- 
lities, the DSMS provides a list with the available streams (together with their 
quality, in this case bit rate and encoding format were considered to be sufficient 
to decide which stream meets the user’s requirements) among which the client 
can choose. In some cases, a lower quality stream which can be provided within 
a short amount of time might be sufficient, whereas in other cases the stream 
with highest available quality is requested, e.g. for a presentation. 

Obviously, missing the estimated deadline for the arrival of a media stream 
will have severe consequences, e.g. in case the scheduled deadline for an im- 
portant presentation in which a media stream is required cannot be met. The 
importance of meeting the deadlines is a major motivation for using the hard- 
ware support for different service classes on the network not only for real-time 
streaming but also for the transmission of media files. 

The media streams can be viewed at the end-systems using web browsers 
with plug-ins capable of decoding and displaying the media assets. In the scena- 
rio considered in this document, the communication between clients and servers 
is based on the R.TSP and RTP protocols. The general architecture of the bu- 
siness TV application is depicted in Figure Q which shows a simplified model 
of the components of the DSMS and how they are integrated in the application 
environment. Some of the components are located on each server whereas the 
core components run as a central service for a larger domain , i.e. a larger set of 
islands which is described more detailed in Section H"?! 



3 The DSMS 

In this Section, the functionality and architecture of the DSMS is described. 



3.1 Functionality 

The DSMS organizes a server network in a decentralized way. Media assets can 
be installed and removed from each of the servers in the network. Once a new 
asset is installed on a single server, the DSMS ’publishes’ this asset to the rest 
of the network thus providing access for each client. After that, the new media 
asset can be distributed in the server network. 

Besides the opportunity of dealing with multiple copies of the same media 
asset introduced due to the requirement for distributing media assets, the DSMS 
allows a single media asset to be present on the servers of the network with dif- 
ferent bit rates or encoding formats. In order to distinguish between the content 
and the physical copies, each media asset is addressed using a logical and a phy- 
sical address. The logical address describes a collection of assets with the same 
content but different properties. The physical address uniquely defines a physical 
media stream located on one of the servers. The DSMS automatically creates 
the physical addresses during the storage process and adapts them during the 
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migration. Each client only uses the logical address to request a media asset 
without having to know the physical locations of the different copies. 

An important functionality used to improve the overall QoS for users of the 
DSMS is to optimize the placement of media assets on the network. A key factor 
for the perceived QoS of the system is the minimization of asset transmissions, 
i.e. the aim is to place assets of interest for users on nearby servers, which 
can be achieved using the algorithms and methods described in ma . In order 
to minimize the requirement for dynamically remapping the assets onto the 
server network during run-time, the DSMS provides the functionality for clients 
to subscribe to content categories , which are stored together with each media 
asset. These categories describe collections of assets with similar content, e.g. 
news broadcasts. Users can subscribe to content categories which means that 
the DSMS automatically copies newly installed media assets to the local servers 
where the corresponding categories were subscribed. In case, a user requests 
assets of a category not subscribed by one of the users in this user’s island, the 
DSMS copies these assets to the client’s local server. 

3.2 System Architecture 

The DSMS groups a number of islands of the server network as administrative 
domains being controlled by modules which are central instances for a single 
domain. In addition to that, DSMS modules running on each server node are re- 
sponsible for the administration of media assets and the communication between 
different servers (see Figure 0). 




Fig. 3. Modules of the DSMS 



The modules running once per domain are the resource management (RM) 
and a database component. The DSMS uses the RM component for the admini- 
stration of the advance reservation mechanism. It is responsible for scheduling 
the migration and transmission of media files and allows the reservation of band- 
width for streaming media assets in advance. The RM manages the available 
bandwidth resources and grants or refuses access to the network for streaming 
traffic. The transmission of media files is also scheduled and initiated by this 
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component. The logical and physical addresses are kept in the database com- 
ponent of the DSMS, also running as a central service for a single domain. In 
order to build a large-scale server network, the central components of different 
domains can be connected. 

The purpose of the modules running on each server node is the administration 
of media assets on the respective node and the physical transmission of media 
assets among different servers. The transmission is scheduled and started the 
domain central RM module which contacts both transmission modules on the 
server nodes involved in order to initiate the transfer of an asset. 

The open architecture of the DSMS allows to use the software not only for 
managing media servers but also other network services which require to transmit 
large amounts of data using different types of networks. 



4 Advance Reservation for Media Streaming and File 
Transmissions 

According to the DiffServ model, service classes can be used for different priori- 
tizations of network traffic categories. In the application environment described 
in Section 0, the hardware support for different service classes could be used to 
implement an advance reservation mechanism, designed but not limited to be 
applied in the environment of the media server network. 

Mapping the media assets onto the server network as mentioned in section 
13 . 1 1 cannot completely eliminate the requirement for transmitting assets across 
the network. For example, this occurs each time changes of the user behavior 
require to remap the assets onto the server network. In case an end-user requests 
an asset from a remote server it is required to provide the time of availability 
of each copy of this particular asset to the end-user. This process can be seen 
similar to booking a hotel room: once a room reservation was made, the guest 
expects that the room is available at the time of his arrival. In the same way, 
the request for a media asset requires the asset to be available at the end-user’s 
local server at a given time. 

The support for advance reservations of bandwidth for streaming network 
traffic is an important aspect especially for video-on-demand systems. Unlike 
other approaches, using RSVP for the resource reservations which either lacks 
the capability to reserve resources in advance or requires to modify or extend 
the protocol, in the DSMS implementation the streaming traffic is prioritized 
using the DiffServ model. 

The approach described in this paper is to reserve a fixed amount of the 
available bandwidth (corresponding to the highest priority DiffServ service class) 
for streaming and non-streaming media traffic. This service class is managed by 
the DSMS. Other network traffic, not related to the DSMS, is treated as best- 
effort traffic (see Figure 0). This allows to reliably estimate transmission times 
of media assets and therefore to meet deadlines for the transmission of media 
data and to provide the required QoS for the streaming traffic. 
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Fig. 4. Usage of a Service Class for Streaming and Non-Streaming Media Traffic 



The resource management component (RM) of the DSMS, described in Sec- 
tion EQl serves as a bandwidth broker |!) which is responsible for allocating net- 
work resources, initiating the transmissions and keeping track of the available 
bandwidth in the DiffServ class reserved for media streaming and transmissions. 
Each instance of the resource management module schedules the communication 
within a single administrative domain which can contain more than one island. 
Currently, the scheduler is implemented according to the first-fit principle: each 
file transmission is scheduled as soon as possible. In order to schedule transmis- 
sions across different domains, the resource management modules can interact 
with each other and negotiate the actual transmission time and available band- 
width. Due to the bandwidth limitation, streaming traffic across domain borders 
is not considered in current scenarios but can be included in future implemen- 
tations. 



High-Priority 
Service Class 



Streaming Traffic (Advance and Immediate Reservations) 





Non-Streaming 
r Traffic Limit 



Non-Streaming Traffic 



> time 



Fig. 5. The Different Traffic Types in the Highest Priority Service Class 



The two types of network traffic (streaming and non-streaming) are handled 
as follows: advance reservations of bandwidth for streaming traffic can be made 
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using a corresponding interface of the RM component. Such a reservation requi- 
res to specify the beginning and the end of the streaming session. In case the 
duration is exceeded, the traffic will no longer benefit from the prioritization. 
Before actually streaming media assets, clients can make use of the advance re- 
servation mechanism, however this is not required. In case a user wishes to see 
a media stream from the local server and no advance reservations were made, 
the RM will grant or refuse access to this asset depending on the current net- 
work load in the highest service class thus avoiding conflicts between different 
transmissions. 

In the scenario described in SectionQ it is sufficient to schedule the file trans- 
fers not initiated by client requests (i.e. transfers required due to the remapping 
of media assets or results of subscriptions) during night time when the traffic in 
the company’s network is low. When client requests require the transmissions 
of media files at day time, only a small amount of the available bandwidth in 
the high priority service class is used for those transmissions in order to assure 
sufficient network resources for streaming traffic. The amount of bandwidth used 
for file transfers can be statically configured and adapted to the requirements of 
the application (see Figure EJ). 

In the current implementation, due to administrative reasons both streaming 
and transmission processes share the same service class. Since the transmission 
of media files does not pose severe constraints such as latency to the underlying 
network, a lower priority service class could be used for this type of traffic. Ho- 
wever, using best-effort service for those transmissions is not sufficient since the 
application (i.e. the users) requires the estimated deadlines to be met and these 
estimations must be tightly calculated in order to achieve an efficient network 
utilization. 

Such a functionality, based upon the hardware support of the DiffServ- 
enabled corporate network has the advantage of an easy administration and 
in contrast to other efforts for implementing advance reservations on networks 
does not require changes to existing protocols or protocol stacks. 

The access to the highest service class is restricted using the router’s built-in 
traffic filtering mechanisms. This assures that only traffic from one of the media 
servers benefits from the prioritization mechanism. 



5 Conclusion and Future Work 

In this document, a management system for large-scale media server networks 
is presented. An important feature of the management software is the provision 
of QoS for both streaming and non-streaming traffic, i.e. transmissions of large 
media assets. 

The DSMS allows to automate the distribution of the media streams within 
the server network thus keeping the administrative effort low. Media assets can 
be stored on each of the servers in the network and the distribution of assets is 
handled by the management system. It is not required to manually transmit me- 
dia assets since they are automatically copied only to the servers where they are 
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required. This functionality is supported by subscriptions to content categories 
and in case an asset is not available at a client’s local server, the management 
software allows a user to choose among all the physical copies of this particular 
asset and to select the stream with most appropriate properties, such as the bit 
rate or the time required for the transmission to the local server. 

In order to efficiently use the available network resources and to provide 
sufficient bandwidth for streaming and non-streaming traffic within the server 
network, DiffServ enabled hardware can be used. This allows to schedule the 
non-streaming network traffic in a way, that provides an efficient network uti- 
lization while meeting the estimated deadlines. In addition to that, immediate 
and advance reservations for streaming traffic can be made, providing the requi- 
red QoS for the video-on-demand service. In contrast to existing approaches for 
establishing advance reservation mechanisms, our solution can be easily applied 
without changes to existing protocols, soft- or hardware. 

In future versions of the DSMS, more sophisticated scheduling algorithms for 
the file transfers will be implemented, e.g. media files could be split and trans- 
ferred in several pieces in order to more efficiently utilize the available network 
resources. Moreover, the usage of alternative routes could be implemented in 
order to avoid bottlenecks on certain network links. 

Currently, the DSMS is used in a business TV application. Other applications 
of the DSMS in the future are TV Anytime systems (i.e. media server networks 
that work like digital VCRs, providing time-independent on-demand access to 
TV programmes) where the aspects of mapping media assets onto the server 
network, in particular the decision which programmes to record, plays a more 
important role due to the enormous amount of media assets that can theoretically 
be recorded. 

Although specially designed to support a media server network, the genera- 
lization of the architecture described in this document seems to be a promising 
approach to implement advance reservations for any type of network traffic. In 
such an environment, aspects such as billing and brokerage which were not in 
the focus of this particular implementation will be introduced to the system. 
Among the features to examine and to implement will be a dynamic pricing 
algorithm that allows to adjust bandwidth prices to the demand and vice-versa, 
i.e. transmission of media files will only be carried out if the gain of transferring 
an asset is higher than the price for the required bandwidth. 
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Abstract. The CORBA Event Service specification lacks important features in 
terms of quality of service (QoS) characteristics required by multimedia 
information. The main objective of the work described in this paper is to 
augment the standard CORBA Event Service specification with a set of 
extensions, to turn it into an adaptable QoS middleware multimedia framework. 
To meet this, some extensions to the CORBA Event Service already developed 
with the aim of providing multicasting and reliability features have been 
enhanced in order to allow the close interaction with multicast transport 
protocols and with QoS monitoring mechanisms. The result was a QoS-aware 
middleware platform that actively adapts the quality of service required by the 
applications to the one that is provided by the underlying communication 
channel. The main quality of service features addressed by the platform - and 
discussed in the paper - are the support of sessions with different reliability 
levels, the provision of congestion control mechanisms and the capability to 
suppress jitter. 



1. Introduction 

A continuous distributed interactive medium is a medium that can change its state in 
response to user operations as well to the passage of time [1]. A broad variety of 
applications use this kind of media, such as multi-user virtual reality (VR), distributed 
simulations, networked games and computer-supported co-operative work (CSCW) 
applications. Systems dealing with multimedia events combine aspects of distributed 
real-time computing with the need for low latency, reduced jitter, high throughput, 
multi-sender/multi-receiver communications over wide area systems and different 
levels of reliability [2], These event-driven systems also require efficient and scalable 
communications components. 

The work being described in this paper consists of a middleware platform for 
continuous distributed interactive media applications based on some extensions to 
the CORBA Event Service. The platform - named Augmented Reliable Multicast 
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CORBA Event Service (ARMS) - provides an end-to-end communication framework 
with QoS-matching capabilities, having in mind VR application requirements. VR 
applications have specification features concerning interactivity, reliability, 
continuity, coherence and strict time constraints that render difficult their traffic 
characterisation in terms of quality of service (QoS) requirements. 

ARMS is focused on VR environments based on VRML models, which deal with 
two different types of information: events and states. Events are time critical and 
described by small amounts of information, as opposed to states that are generally 
non-time-critical and require large amounts of information for their description. 
Therefore, there is the need for different levels of reliability when exchanging event 
and state information: minimal reliability (with loss detection) for events, and full 
reliability for states. In addition, ARMS assumes that there is a common time 
reference, which requires the clocks of all participants to be synchronised by means of 
NTP or GPS clocks. 

In continuous distributed interactive media there are operations that need to be 
executed at a specific point in time and in a correct order for consistency to be 
achieved. [1] investigates this problem and proposes a solution by deliberately 
increasing the response time in order to decrease the number of short-term 
inconsistencies, leading to the concept of local lag. Instead of immediately executing 
an operation issued by a local user, the operation is delayed for a certain amount of 
time before it is executed. The determination of this value is a typical issue of 
application adaptability. 

VR applications are associated with three important functions: scalability, 
interaction and consistency [2]. The QoS characteristics that influence the previous 
functions are reliability, losses and delay jitter. Additionally, those functions are 
influenced by application adaptability factors like frequency of events, 
synchronisation delay, number of participants, consistency and playout time (display 
frequency) [2]. 

The ARMS platform addresses the above-mentioned QoS issues and explores the 
adaptation between the quality of service required by applications and the one that is 
provided by the underlying communication system. This paper presents the main 
approaches taken by ARMS in order to provide QoS adaptability. Section 2 provides 
a general description of the ARMS architecture. As the work presented in this paper 
corresponds to an enhancement of a previous, non-QoS-adaptive platform, section 3 
presents the characteristics of this previous platform. Section 4 provides details 
concerning the approach taken by ARMS in terms of reliability, congestion control 
and jitter. Section 5 describes additional features of the ARMS platform, namely the 
IIOP/IP multicasting gateway service and the federation of event channels. Section 6 
identifies related work. The conclusions and guidelines for further work are presented 
in section 7. 



2. ARMS General Characteristics 

Portability over heterogeneous environments is a typical requirement of distributed 
multimedia systems. Although this portability could be provided by middleware such 
as CORBA, there exists a widespread belief in the virtual reality community that the 
quality of service offered by CORBA is not suitable for next-generation large 
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distributed interactive media [3]. Also, another obvious deficiency of CORBA is its 
lack of support for interaction styles other than request/reply [4], However, due to 
their nature, the CORBA Event Service [5] and the CORBA Notification Service [6] 
have some potential be used for continuous distributed interactive media applications. 

Previous research has been focused on limitations of the CORBA Event Service, 
namely multicasting, reliability and bulk data handling [7,8]. The work presented in 
this paper extends it to support adaptive QoS middleware functionalities. 

ARMS offers a set of QoS-related mechanisms for reliability guarantee, congestion 
control and jitter control. The QoS management process is supported by object-based 
monitoring and adaptation functions (Figure 1). Monitoring is the process of 
observing the utilisation of resources and/or QoS characteristics in the system. ARMS 
has specific objects for loss and jitter monitoring. 




Fig. 1 . ARMS architecture 



Adaptation mechanisms generally rely on resource control, reconfiguration or change 
of service [9]. ARMS uses a resource control paradigm, providing adaptation 
mechanisms for network congestion, losses and jitter. Placing adaptation capabilities 
in the middleware gives applications the ability to concentrate on specific 
functionalities, to enforce different adaptation policies and to interact with other 
components in the system in order to ensure fairness and other global properties [10], 
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3. Synopsis of ARMS Version 1 

The first generation of ARMS offered a set of extensions to the CORBA Event 
Service specification [7], namely mechanisms for IP multicast communication, 
reliability and fragmenting/reassembling of large events (Figure 2). 




This service is a pure Java implementation (JDK 1.2 compatible) [11], and so it 
benefits from all the strengths of Java. These include portability, security, and 
robustness. It is, also, ORB-agnostic. It is written to the standard IDL-to-Java 
mapping, so it should work with any Java ORB that supports the standard mapping. 
As in the original specifications made by OMG CORBA Event Service, suppliers and 
consumers are decoupled, that is, they don’t know each other’s identities; thus, the 
standard Event Service may still be provided by this extended service. 

The approach on which the developed work was based is the one called Push 
model. The objective is to allow the consumer to be sent the event data as soon as it is 
produced. The canonical Push Model allows Suppliers of events to initiate the transfer 
of event data to Consumers. 

The reliable multicast extension can be seen as an alternative way to get the event 
across to the consumer. This assumption forces the new service to keep all the 
standard interfaces with the same functionality (the same methods) that are defined by 
OMG, allowing a choice to be made by the supplier/consumer between HOP and 
Reliable IP Multicast. The service implements two kinds if IP multicast interfaces: IP 
Multicast-Any and IP Multicast-Streams. The IP Multicast-Any deals with Any 
Values while IP Multicast-Streams deals directly with byte-stream values, avoiding 
the overhead caused by marshalling and de-marshalling of the proprietary Any. The 
reliable multicast solution is based on the Light-weight Reliable Multicast Protocol 
(LRMP) [12], which deals with IP Multicasting and provides the necessary reliability 
and better scalability (Figure 3). 
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Fig. 3. ARMSvl architecture 



4. ARMS Version 2 

The quality of service provided to multimedia sessions is determined, in general, by 
packet losses and delay. The second generation of ARMS acts as a transport level 
with QoS monitors that provide information on key aspects of QoS provision such as 
losses and jitter. Adaptation mechanisms for network congestion (congestion control), 
losses (reliability sessions) and jitter (jitter filter) have an important influence on the 
performance of virtual environment applications. 



4.1 Reliability Sessions 

ARMSv2 supports several reliability levels: Reliable, Reliable-Limited Loss, Loss 
Allowed, Unreliable with loss notification and Unreliable. 

The first is a typical strong reliability session. Reliable-Limited Loss is a session 
where limited losses are permitted for some types of packets and loss notifications 
will be triggered when this happens. It provides guarantees for congestion control at 
sender side and sequence control at the receiver side. Loss Allowed is a session where 
losses are allowed and accepted for all types of packets but provides loss notification, 
congestion control at sender side and sequence control at the receiver side. Unreliable 
with loss notification is an unreliable session, where data are not subject to congestion 
control at the sender side. However, the ARMS upper level maintains a queue for 
sequence number control, which allows loss notification. Lastly, Unreliable is a 
purely unreliable session. Table 1 summarises the characteristics of the various types 
of sessions offered by ARMSv2. 
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Table 1 . Types of ARMSv2 sessions 
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Losses 


Loss 
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Congestion 

Control 


Sequence 

Control 


Reliable 


No 


Yes 


Yes 
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Reliable-Limited 

Loss 


Yes, for some 
data types 


Yes 


Yes 


Yes 


Loss Allowed 


Yes 


Yes 


Yes 


Yes 


Unreliable with 

loss notification 


Yes 


Yes 


No 


Yes 


Unreliable 


Yes 


No 


No 


No 



4.2 Congestion Control 

Currently, different approaches are being discussed to solve QoS and congestion 
control problems [13]. Complex distributed applications residing in heterogeneous 
end-to-end computing environments must be flexible and adapt to QoS variations in 
their end-to-end execution [10], That is, applications must adapt the bandwidth share 
they are using to the network congestion state [13]. Usually, there are two distinct 
levels at which adaptation may take place - the system level (e.g. operating systems 
and network protocols) and the application level - with different objectives. In order 
to balance the objectives of these approaches, the ARMS middleware closely interacts 
both with application needs and with multicast protocols, monitoring network 
parameters and operating systems resources. 

In terms of network QoS control mechanisms, ARMS directly monitors the reliable 
multicast communication protocol, LRMP, adapting the sender transmission rate to 
the network congestion state. The adaptation is based on information carried by 
NACK packets and on local congestion information [8] as the sender window size 
[12]. Based on congestion information gathered from lower communication objects, 
ARMS and applications adjust the upper-level sending rate. 

QoS mechanisms that are based on adapting the sender transmission rate to the 
network congestion state don’t work well in large multicast groups and heterogeneous 
environments, because poor performance receivers would impose a low transmission 
quality. To avoid this, several proposals have been made for hierarchical data 
distribution [14,15,16]. Nevertheless, in virtual reality environments, data layering 
approaches are not appropriated for most data, especially for time critical data such as 
VRML events. Delay is the most important QoS factor for this type of data. Layered 
data mechanisms solve heterogeneity problems but cause additional delay at the 
receivers [13]. Thus, to avoid this, ARMS adapts the minimum rate to ensure the 
fairness of the adaptive congestion control. Receivers should leave the session when 
the loss rate is very high and the data rate in not reduced by the sender. Nevertheless, 
layered multicast [14] can be useful and will be explored in subsequent stages of the 
work. 
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4.3 Jitter 

Variance in end-to-end delay is called delay jitter or, simply, jitter. Critical 
information such as the case of audio, video and continuous distributed interactive 
media (distributed simulations, multi-user virtual reality) should be played back 
continuously, which means that there must be some form of jitter compensation. 




NETWORK DELAY- 



ARMS 

DELAY 



Fig. 4. Components of end-to-end delay 

In addition to components for error control, ARMS provides mechanisms for 
monitoring and controlling jitter (Figures 4 and 5), so that the original temporal 
relationships can be recovered. ARMS lower level reliable multicast protocol queue 
sends ordered packets (reliable data) to an upper level Jitter Filter Queue. This Jitter 
Filter Queue is used to absorb delay variations exhibited by arriving packets. So, this 
compensation is done by introducing an additional and variable delay, followed by the 
delivering of packets to the application level, as opposed to a compensation made at 
the application level [17], 

ARMS allows applications to contract this jitter compensation [the threshold 
synchronisation] for certain types of data, namely unreliable data (pure or with loss 
notification) and reliable data with limited loss. Packets that arrive after a given 
threshold are considered to be too late. These can simply be dropped or, alternatively, 
be marked as late packets and passed to the application. Applications can ignore these 
packets or can react by requesting that a special NACK be sent by ARMS. 



Jitter Algorithms 

ARMS implements the Filter Jitter with two different algorithms. The choice of which 
algorithm to use is made at configuration time. The first algorithm is described in the 
following paragraphs. 

To recover the original timing properties, the Jitter Filter buffers (Figure 5) the 
packets at the sink until time T + D, where T is a source timestamp and D is the 
bounded maximum end-to-end delay. When networks are unable to guarantee a 
maximum end-to-end delay bound, the receiver continuously updates an estimate of 
the maximum delay in order to calculate the buffering time. One of algorithms that 
can be used to calculate D is based on the RTP specification [18] to estimate the 
statistical variance of RTP data packet inter-arrival time, measured in timestamp units 
and expressed as unsigned integer. At a given instant, D is the maximum of all jitter 
values calculated up to that instant according to the formula: 



J= J M + ( |Dif(i-l, i)| - J j.jj/16 
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where the inter-arrival jitter J is defined to be the mean deviation of the difference Dif 
in packet spacing at the receiver compared to the sender for a pair of packets [18]. 
This is equivalent to the difference in the “relative transit time” for the two packets. 
So, Dif may be calculated as 

Dif(i,j) = (Rj - Rj) - (S J - Sj) = (R - S,) - (R 1 - S.) 

S is the reliable multicast protocol timestamp for packet i and R i is the time of arrival 
in reliable multicast protocol timestamp units for packet i, the same for packet j. Jitter 
is calculated continuously as each packet is received, for each source. Factor 1/16 was 
chosen to reduce measurement noise while converging reasonably quickly [18,19]. 
The code below implements the algorithm, where the estimated jitter can be kept as 
an integer. 

protected void update Jitter(int timestamp ) ( 

int elapsed = NTP.ntp32(lastTimeForData - timestamp); 
elapsed = NTP.fixedPoint32ToMillis(elapsed ); 
int d; 

if (transit != 0) 

d = elapsed - transit; 
else 
d = 0; 

transit = elapsed; 

>f (d < 0) 
d = -d; 

jitter += d - ((jitter + 8) » 4); 

} 

I 

The second jitter algorithm is based on the work of [20], where statistical analysis of 
per-packet delay is used to estimate the maximum delay: 

D = d + r * s 

where d is the average delay, s is the standard deviation and r is a filter coefficient 
[20], The algorithm continuously estimates the average delay and standard deviation, 
and is based on the ‘low pass filtering algorithm’ used in TCP for the estimation of 
the acknowledgement delay time [20]. So, for each packet, the transmission delay, t, 
is calculated as the difference between the reception time and the emission timestamp. 
The average delay and standard deviation are the calculated as d= d ld + a(t -d) and s= 
s old + b(|t - d| - s), respectively. The constants a and b (a,b <1) are smoothing 
coefficients, with the typical values 1/8 and 1/16, respectively. 
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5. Additional Features of the ARMS Platform 

In addition to providing QoS adaptation, the ARMS platform has some optimisation 
features that contribute to its transparency and scalability. With regard to 
transparency, the platform provides a gateway service between HOP and IP 
Multicasting. Scalability is supported by the federation of Event Channels. 



5.1 Event Channel IIOP/IP Multicasting Gateway Service 

ARMS guarantees interoperability between event suppliers and consumers in a way 
that is independent of the used communication facilities - standard HOP or IP 
multicast. To achieve this, the Event Channel provides a transparent HOP gateway. 
As can easily be understood, the interfaces that are provided by OMG must be 
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maintained by the proposed service in order to deal with the Any type of data. In this 
way, interoperability between the proposed service and most of the commercially 
available ORBs can be achieved. In order to deal with all the values supported by the 
Any type, a new object - Any - was created as an extension of the OMG class, to 
override the ORBs implementation. 

The values to be passed to this service, in order to be sent via multicast, shall be 
created with the proprietary interfaces of the service, that provide a set of new 
methods to deal with the resulting buffers and fill the LRMP data packet with the data 
contained in those buffers. So, by creating a method for extracting the sequence of 
bytes generated by the Any marshalling operation, the result is a stream of bytes that 
mean nothing to LRMP. Once they get across to the consumer LRMP object, it 
reassembles them to form an Any , and delivers them to the consumer as a valid Any 
that shall be extracted by the application interfaces. 

To provide the interchange of Multicast events and Standard events, two scenarios 
are considered: 

• when the supplier is a MulticastPushSupplier and the consumer is a 
normal Consumer; 

• when the supplier is a normal Supplier and consumer is a 
MulticastPushConsumer. 

For the first scenario, and knowing that the model to be dealt with is the Push 
Model, the event is sent via LRMP, i.e. is sent to a well known multicast group, and is 
pushed to Multicast Consumers by one LRMP object that receives events directly. As 
the Event Channel holds one ProxyMulticastPushConsumer for all Suppliers, this 
proxy acts as a normal consumer to which events are pushed. Each time this proxy is 
pushed a new event, it invokes the receive method on the Event Channel. This 
invokes a receive method on the Cons inner Admin, that holds references to all 
ProxyPushSuppliers (standard) and shall invoke the receive method on all those 
proxies, that will then invoke the push method on the consumer they are attached to. 
The ConsumerAdmin does nothing on the ProxyMulticastPushSupplier, or else events 
would be sent twice to the multicast group. 

For the second scenario, the supplier calls the push method on its 
ProxyPushConsumer, that will call receive on the Event Channel, to get the event to 
standard consumers, and will also ask for any existent ProxyMulticastPushSuppliers. 
If there are any, the proxy shall call the receive method on that proxy. In this way, the 
event is sent to the multicast group, and all MulticastPushConsumers receive it. So, 
the Event Channel is responsible for doing all the necessary HOP gateway work. 

In either scenario, the gateway operation implies changes to the Any Object that is 
being forwarded. The Event Channel also takes care of this detail, at the cost of an 
additional overhead caused by the de-marshalling of the proprietary Any and 
subsequent marshalling as an ORB Any, and vice-versa. 



5.2 Federation of Event Channels 

Distributed transparency of the Event Channel can lead to a less effective 
configuration. There are scenarios where consumers and suppliers reside in the same 
process, host or network and the Event Channel is remote. In these cases, there is a 
waste of network resources and unnecessary increase of latency. ARMS Event 
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Channel object uses a configuration facility to federate Event Channels, allowing an 
Event Channel to be a consumer of another, remote Event Channel. Federation of 
Event Channels leads to the conservation of bandwidth because only a single event 
will be sent to all the remote users. Additionally, the average latency is reduced 
because part of the traffic becomes local. Figure 6 illustrates the use of the federation 
of event channels when communication is made via standard HOP. In Figure 7 
federation of event channels is used in conjunction with the IIOP/IP Multicast 
Gateway Service. 

The combination of IIOP/IP Multicasting Gateway Service with Event Channel 
federation contributes to the enhancement of the QoS characteristics of the platform, 
namely in terms of transparency, latency and scalability. 




Fig. 6. Federated Event Channel configuration: standard HOP 




Fig. 7. Federated Event Channel configuration with IIOP/IP Multicasting Gateway Service 



6. Related Work 

Several approaches exist that try to explore QoS issues in distributed object 
environments. 

OrbixTalk [21,22], which is a commercial IONA implementation of the CORBA 
Event Sendee specification normalised by OMG [5] written in C++, uses IP 
multicasting and a reliability mechanism based on negative acknowledgements, in 
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order to provide delivery confirmation of each information object. Because the 
CORBA Event Service specification does not address issues for real-time 
applications, the QoS behaviour is not acceptable for many application domains. 

OMG Notification Service [6] is a superset of the CORBA Event Service 
specification. It adds some interfaces, which deal with filtering, security, federation 
and QoS. However, this specification does not address implementation issues. 

TAO’s Real Time Events Service [3] it’s a powerful implementation of CORBA 
Event Service and Notification Service specifications in C++. Nevertheless, it lacks 
some QoS characterisitcs, such as reliability. However, OMG CORBA Messaging 
specification [23] defines several levels of reliability for one-way calls, which are 
being added to the TAO features to complement this service. This group has been 
developing considerable work on real-time extensions to CORBA, which enable end- 
to-end QoS specification and enforcement. 

Other CORBA projects, such as QuO [24], implement models for distributed 
object application, defining, controlling, monitoring and adapting to changes in QoS 
parameters. The QuO project proposes various extensions to standard CORBA 
components and services, in order to support adaptation, delegation and renegotiation 
services to shield QoS variations. The development has a great focus on remote 
method invocation to remote objects. 

ARMS is complementary to the above-mentioned approaches, since it is based on 
the integration of CORBA middleware and underlying services - such as multicasting 
- while the referred approaches are concentrated on the CORBA object model. 



7. Conclusions and Guidelines for Further Work 

Quality of service requirements of continuous distributed interactive media 
applications encompasses several aspects that are not readily available in standard 
distributed object platforms. The need for different levels of reliability, congestion 
control mechanisms and jitter suppression strategies is apparent in applications such 
as virtual reality, CSCW and distributed interactive simulations. 

Middleware platforms can play an important role in quality of service provision, 
offering flexible mechanisms that adapt the quality of service provided by the 
underlying communication channel to the quality of service required by applications, 
according to an established service contract. 

This paper presented the main implementation options of the ARMS middleware 
platform, which builds on a set of extensions to the CORBA Event Service, providing 
native multicast communication, various reliability levels, congestion control and 
jitter suppression, with the aim to achieve QoS adaptability. The platform has been 
implemented at the Laboratory of Communications and Telematics of CISUC, where 
it is operational and currently being subject to extensive evaluation. 

The first set of tests made to the platform was aimed at the verification of the 
platform operational status. The basic mechanisms for reliability, congestion control 
and jitter proved to be operational. The following tests will try to quantify the 
usefulness and effectiveness of these mechanisms. These will provide valuable 
information concerning the adequacy of placing QoS adaptation mechanisms in the 
middleware, as opposed to strategies that place them at application level. 
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Additional lines of research will explore the use of Forward Error Correction 
(FEC) mechanisms in order to provide some degree of reliability to time critical 
information, and the use of event filtering based on IP multicast groups in order to 
improve scalability. This latter functionality will be developed in a centralised 
service, which will map different event types to different IP multicast groups. 

In heterogeneous environments, layered multicast can be useful. This will also be 
explored in subsequent stages of the ARMS platform, namely by dynamically 
assigning receivers to multicast groups with different QoS levels, according to the 
required quality of service. 
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Abstract. The requirements for the QoS of distributed applications are 
traditionally expressed in terms of network oriented or systems oriented 
parameters. In general, the users of these services are not interested or capable 
of specifying the QoS of their services in such technical terms. In this paper, we 
propose modeling and engineering concepts for the mapping of end user QoS 
onto system and network QoS. We introduce QoS agents in structured object 
middleware that relate end-user QoS specifications to multimedia stream 
bindings. In fact, the middleware layer supports QoS classes, i.e., a set of QoS 
characteristics. The end user QoS requirements, generally a set of non- 
orthogonal specifications, must be supported using the available middleware 
QoS classes. We also describe the experimental environment that will be used 
to refine the QoS mapping mechanisms. 



1. Introduction 

Object middleware, such as CORBA, DCOM and Java RMI, is gaining rapid 
acceptance as a means to quickly and cost effectively develop a wide range of 
applications for various areas of industry. The main purpose of these middleware 
systems is to provide a software infrastructure for interacting application components. 

Interactions between software components can be divided into two categories: 
discrete or continuous. Discrete interactions are typically RPC or message-oriented 
and are generally used to invoke a computational service. Continuous interactions are 
generally used for exchanging (multi-) media data and are usually called streams. 
Middleware platforms often model a stream as a binding object [Gay, Blair], In this 
paper we focus on the support of middleware platforms to establish a user-oriented 
QoS for such binding objects. 

Requirements for QoS are traditionally expressed in terms of system-oriented or 
network-oriented parameters. In this paper we propose a slightly different approach. 
We start from user-oriented requirements for QoS and identify where and how the 
middleware can translate them to system or network-level requirements for QoS. We 
define the architectural concepts that we think are necessary for understanding and 
decomposing the complexity of stream binding establishment. We further identify 
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how current object middleware systems can be extended to support media stream 
binding. 

Section 2 describes a framework for QoS-aware middleware. Section 3 identifies 
the interfaces for QoS specification and section 4 describes how user-oriented QoS 
specifications can be mapped to network and system level QoS. Section 5 presents 
how our framework can be applied to an implementation of the CORBA A/V Streams 
specification. Section 6 summarizes with our conclusions. 



2. A QoS Framework for Streaming 

In this section, we propose a framework for QoS-aware middleware-based distributed 
systems that combines objects, layers and planes. We have used the framework to 
structure the QoS-related functions of a distributed system. This framework is 
currently being validated in our testbed. 

In Section 2.1, we present the object model we use. In Section 2.2 we discuss how 
the object model fits into our framework from a high level point of view. In Section 
2.3 we consider the framework in more detail, in particular in terms of its layers and 
planes. 



2.1 Object Model 

Fig. 1 shows the kind of objects that we use to structure our system [Blair]. Each 
object encapsulates state and behavior and can expose operational and streaming 
interfaces to other objects. 




Tlow 

Stream 



Fig. 1. An object with operational and streaming interfaces. 

Operational interfaces allow client objects to invoke computational services onto the 
object that exposes the interface (the server object). Streaming interfaces allow 
objects to exchange one ore more flows of continuous information (e.g. audio or 
video information). Flows are unidirectional and may terminate at one or more sink 
interfaces. A stream may consist of flows that travel in opposite directions. 

Objects may further consist of several other objects at a lower level of abstraction. 
The higher level object is called a compound object in such cases. 
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2.2 QoS-Aware Middleware Platform 

We propose a middleware platform that is capable to associate a QoS with an object’s 
streaming interface and to control this QoS through the same or another object’s 
computational interface. We propose to extend existing middleware concepts by a 
QoS framework. We elaborate this framework in particular for middleware that 
supports multimedia applications using video and audio streams. 

The application objects, e.g., cameras, microphones, speakers, files, are endpoints 
of audio and video stream. The application layer furthermore contains agent objects 
that act as stand-ins for the end-user (cf. [Jmf]). Agents invoke the services of our 
middleware platform to bind endpoint objects and to control the QoS of the stream 
that the binding carries. Agent objects also control the QoS of local endpoint objects. 

A binding object allows endpoint objects in the application layer to exchange a 
stream of multimedia information. A binding object represents a composition of the 
resources that are involved in tying these application level objects together. This 
includes media processing resources like codecs, multiplexers, transcoders and 
packetizers, as well as transport-level resources such as sockets, routers and bridges. 

Example 

Consider a medical application that allows surgeons to view video clips stored in a 
database. We assume that a request to connect to the video database originates from 
the surgeon’s machine. Fig. 2 shows the object constellation once the surgeon is 
viewing a video clip. 



Client-side Server-side 

Application Middleware Application 

Layer Platform Layer 






Client Machine Server Machine 



Fig. 2. Binding object interconnecting two application objects. 

The video server object in Fig. 2 represents the database. It produces an audio- 
video stream that flows to the player object via the binding object. The player 
represents the presentation resources on the surgeon’s machine (a display and a 
speaker in this case) and consumes the stream that the binding produces. 

The two agent objects use the operational interfaces of the player, the video server 
and the binding object to control the QoS of the streams that these objects produce. 
The client-side agent object may for instance use the binding’s operational interface 
to request the establishment of a certain QoS for the binding or to change the QoS of 
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an existing binding. Similarly, the agent can call the player’s operational interface to 
locally reduce the volume of the video clip’s audio part. 

The operational interfaces of the binding object have two special properties. First 
of all, their operations are application oriented. This means that the binding object 
allows the agents to deal with QoS in application-level terms. For example, the 
binding object allows the client-side agent to specify that it wants the player object to 
be bound to the video server object using a binding object with an High Definition TV 
QoS level. As a result, the agent objects do not have to be able to translate this 
abstract notion of QoS into low-level QoS aspects such as bandwidth, delay, and so 
on. 

The second property is that the binding object’s operational interfaces are geared 
toward the context in which the application is used. We call this the application’s 
usage context. This notion is based on our belief that different applications require a 
different QoS depending on the domain in which they are used, for what purpose the 
application is used and by whom. For example, if a surgeon uses the system of Fig. 2 
to discuss a video clip with one of his fellow surgeons, the client-side agent object 
will generally request a higher QoS from the binding object than when the surgeon 
discusses the same clip with one of his patients. We will have more to say on the 
application-orientation of the binding and our notion of a usage context in Section 3. 



2.3 The Platform’s Internals 

The binding object of Fig. 2 may encapsulate a large number and a variety of 
resources It may furthermore need to perform complex QoS control activities. We 
tackle this problem by decomposing the platform into two horizontal layers and three 
vertical planes. The vertical planes are a data transfer plane, a QoS control plane and a 
QoS management plane (cf. [Aurre]). Across these planes we distinguish between a 
middleware (software) layer and the Distributed Resource Platform (DRP) layer (i.e. 
computing and network resources). 

Data Transfer 

The data transfer plane contains the objects that are concerned with forwarding the 
data units of a multimedia stream. 

An object in the data transfer plane of the middleware layer encapsulates resources 
that perform transport-independent as well as transport dependent stream processing. 
Examples of the former are encoders, transcoders and multiplexers; an example of the 
latter is an RTP packetizer that can adapt an MPEG-1 encoded stream for 
transmission over UDP. 

The objects in the data transfer plane of the DRP encapsulate the distributed 
resources (e.g. IP routers, bridges, etc.) that provide end-to-end connectivity. 

QoS Control and Management 

The objects in the QoS control and management planes of the middleware layer and 
the DRP govern the QoS of the stream that flows through the data transfer plane. 

Each binding object encapsulates a set of objects in the QoS control plane of the 
middleware layer and the DRP. These objects govern the QoS of the binding’s 
streaming interfaces during its lifetime. We propose that objects in the QoS control 
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plane are responsible for establishing a QoS for a binding. The establishment of QoS 
typically involves the negotiation of an acceptable QoS followed by the reservation 
and initialization of objects in the data transfer plane to effectuate the negotiated QoS. 
Other activities that can be found in the QoS control plane involve [ISOQoS, D3.1.2] 
predicting a binding’s current and near- future QoS, keeping a binding’s current QoS 
in line with the negotiated QoS, and releasing a binding and its resources. 

The objects in the QoS management plane of the middleware layer and the DRP 
are not part of a particular binding. Rather, they can be considered part of every 
binding because their activities transcend the lifetime of individual bindings. Objects 
in the QoS management plane for instance take care of fault management and 
statistics collection. 



Example 

As an example, assume that the binding object of Fig. 2 uses an MPEG-1 encoder to 
compress the audio-video stream. Also assume that the binding relies on an RSVP 
reserved UDP transport channel to convey the encoded stream from the server to the 
client. Fig. 3 shows how the example of Fig. 2 maps onto the layers and the planes of 
our QoS framework. Observe that it zooms in on the middleware portion of Fig. 2. 
Fig. 3 does not show the application layer and the objects that it hosts. 



Client-side UDP Server-side 

Middleware Transport Middleware 

Layer Provider Layer 




Data 

y T ransfer 
Plane 



QoS 

V Control 
Plane 



QoS 

y Management 
Plane 



Client Machine Server Machine 



Fig. 3. Two-dimensional version of the QoS framework. 

The encoder and decoder are middleware level objects of the data transfer plane. The 
encoder for instance encapsulates the MPEG-1 encoder and an RTP packetizer. The 
RSVP/UDP connectivity object is part of the data transfer plane of the DRP and 
encapsulates the resources that connect the encoder to the decoder. 
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The controller objects are part of the middleware layer’s QoS control plane. They 
use the operational interfaces that the encoder, decoder and connection objects expose 
to control the QoS of the streams that these objects produce. The controllers may for 
instance first negotiate an encoding and a packetizer to use. The controller on the 
server-side can then invoke the operational interface of the selected encoder, for 
example to set its output bitrate parameter. After that, the controller can use the 
operational interface of the connection object to request a QoS that supports the 
encoder’s output. Observe that the notion of QoS at the connection object’s 
operational interface is of a lower level than that at the binding object’s operational 
interface. The former typically expresses its notion of QoS in terms of bandwidth, 
delay, jitter, and so on, whereas the latter uses application-oriented notions such as 
“HDTV QoS”. 

Observe that the objects in the distributed processing part of Fig. 2 and Fig. 3 build 
on an ORB infrastructure therefore they can invoke each other’s operational interfaces 
even when they reside on different machines. For the sake of clarity, we have not 
drawn these interactions. 

Also observe that Fig. 3 represents non-establishment activities of the QoS control 
plane (e.g. checking the current QoS against the negotiated QoS) as a single object. 
The activities of this object are outside the scope of this paper and we have therefore 
not shown its interactions with the other objects in Fig. 3. For similar reasons, we 
have also not shown how the object that represents the management activities (the 
meta-controller object in Fig. 3) relates to the objects in the binding. 



3. QoS at Interfaces 

In the previous section, we presented our QoS framework. In this section, we 
elaborate on the specification of QoS at interfaces and the mapping of these 
specifications to QoS support of the middleware. We define a QoS for a streaming 
interface by specifying the desired QoS at an operational interface 

We use the QoS specification language QML [Frplund] to illustrate our approach. 



3.1 Control-Plane QoS Interface Model 

In our framework (see Section 2), QoS represents the quality aspects of the 
interactions between the middleware platform and one or more application objects 
that act as the endpoints of a stream. QoS realizes the quality aspects of the 
interactions between these application objects and thereby improves the usability or 
the utility of the applications that use the platform. 

The concepts that we use to specify QoS are based on the user-provider model 
[ISOQoS, D3.1.2] (see Fig. 6) in which a provider (our middleware platform) 
mediates the interactions between its users (our application objects). 
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Example: 



UrQoS: 

Interactivity = “moderate” 



QoSrequirement_ 1 : 
“> (100ms.)' 1 ” 



MidQoS characteristic_l : 
(Is.)' 1 <swiftness < (10 ms.)' 1 




Fig. 4. Control-plane QoS Interface Model 



From a top-down perspective, the quality needs of the interactions of application 
objects drive the QoS that is required at streaming interfaces. These quality needs are 
called user requirements [ISOQoS] with respect to QoS (urQoS). In principle, urQoS 
is independent of the service provider that mediates the interaction. 

The provider may advertise its QoS capabilities in terms of the value s that i t 
supports for a certain set of QoS characteristics (midQoS characteristics in |Fig. 4| . 
QoS characteristics are the quantifiable aspects of QoS such as accuracy, freshness or 
reliability. The provider’s capabilities are in principle also defined independently 
from the users. Aggregations of QoS characteristics and their associated values are 
called QoS classes. A provider may support several QoS classes. Each class may be 
optimized for a certain category of services. Examples are QoS classes for real-time 
services, messaging services, or multimedia services. For reasons of flexibility, a 
provider may also offer QoS classes of different abstraction levels. QoS classes may 
for instance be network-oriented or application oriented. 

To specify a QoS at a provider’s interface, we need to translate urQoS (including 
the associated values needed) to an instance of a QoS class that the provider offers. 
To facilitate this, we use the concept of a QoS requirement [ISOQoS]. A QoS 
requirement consists of required values of a QoS characteristic and the qualifiers for 
these values (e.g. maximum, mean, or minimum). 



3.2 User Requirement 

The urQoS of Fig. 5 typically depends on what we call the application context of use. 
We have analyzed a number of reports (see for instance [Ewos]) that describe the 
properties of multimedia data interchange in the medical domain. The reports cover 
user scenarios as well as the media characteristics and alternative stacks of protocols 
(including the compatible parameter options) that are suitable for these scenarios. The 
reports are suitable for our work because communication experts as well as medical 
specialists have contributed to them. 

Our analysis leads to the following elements as determining an application’s usage 
context: 




Middleware Support for Media Streaming Establishment 165 



• the application domain: examples are the medical domain of surgery and the game- 
hall domain; 

• the role of the user: examples are imaging specialist, surgeon, game-hall subscriber 
or guest user; 

• the purpose for which the application is used: for example, a surgeon may use a 
moving image retrieval application to check his diagnostic hypothesis or to view 
images during surgery. Latency requirements of image retrieval may for instance 
be different for both cases. 

We have furthermore identified the following urQoS dimensions: availability, 
fidelity, integrity, interactivity, and regulatory. Since these dimensions are not 
orthogonal the translation of urQoS dimensions to provider-oriented QoS 
characteristics (midQoS) is not straightforward. 



Table 1 . User-oriented QoS dimensions 



QoS dimension 


Quality aspect 


Availability 


Present or ready for immediate use 


Fidelity 


Good (i.e. correct) enough in respect of the application purpose 


Integrity 


Delivering the whole truth in respect of the source 


Interactivity 


Being responsive 


Regulatory 


Conformant with rules, the law or established usage 



Example of urQoS specification 

This example illustrates urQoS dimensions and values that are relevant for a surgeon 
who retrieves moving images from a medical imaging repository. The surgeon uses a 
Video-on-Demand (VoD) application to retrieve the images for one of the following 
purposes: 

• to validate a diagnostic hypothesis; 

• to have a closer look at the images to prepare a surgical treatment, for example 
in case the hypothesis has been confirmed; 

• as reference material during surgery. 

We may specify a template QoS category, i.e. a group of related urQoS dimensions, 
suitable for the surgeons’ VoD application by the following QML contract type: 

type Surgeon_VoD_QoScategorytype = contract 

{//VoD appl . contains list, browse, and retrieve methods 
availability: set { "list" , "browse" , 

"diagnostic phase" , "surgery phase"}; 
integrity: increasing set {"lossy", //for listing 

"functional lossless", //compression < 25:1 

"perceptual lossless"}; //compression <12:1 

//increasing= higher values are better 
//remark: fidelity is modelled via integrity 
interactivity: increasing set{ "low" , "normal" , "high" } ; 
regulatory: set {"not relevant" , "CR standard", 

"Xray standard" } ; 
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The QML contract that specifies the urQoS dimensions and their possible values for a 
diagnostic hypothesis validation may then look as follows: 

Diagnostic_validation_QoS category = 

Surgeon_VoD_QoScategorytype contract { 
availability == { "list" , "browse" , "diagnostic_phase" } ; 
integrity == { "lossy" , "functional lossless"}; 
interactivity == { "low" , "normal" } ; 
regulatory == { "CR standard" , "Xray standard"}; 

} ; 

For the preparation of a surgery, we may use a more stringent value for integrity, 
"perceptual lossless" for instance. Similarly, during the surgical treatment 
we may instantiate the VoD application with the more stringent value "surgery 
phase" for availability and the value "high" for interactivity. 



3.3 QoS Classes and Characteristics 

In the user-provider model of Fig. 5, the provider offers services as well as a set of 
associated QoS capabilities. We use QoS classes to facilitate these capability offers. 
Similar approaches can be found in ATM networks [Alles] and QoS-aware systems 
(e.g. [Lazar]). 

A QoS class defines a set of QoS characteristics that the provider supports. In this 
paper, the definition of a QoS class also encompasses the (ranges of) values of its 
QoS characteristics. 

Example: 

The following QML contract defines the dimensions of a QoS class for an RSVP- 
based transport service: 

type RSVPbased_QoSclasstype = contract { 
delay: decreasing numeric msec; 
rate: increasing numeric Mb/s 

}; 

An instance of the above QoS class type that supports maximum delay guarantees is: 
RSVPguaranteed_QoSclass = 

RSVPbased_QoSclasstype contract { 

delay < 100; //bounded delay in a guaranteed service 
rate < 0.064; //ISDN is e.g. a bottleneck link 
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The QoS characteristics of a QoS class are generally not completely independent of 
each other. A provider can furthermore define alternative sets of QoS characteristics 
and offer them to the users in the form of similar QoS classes. For example, a 
provider may offer a QoS class that uses the inverse of the delay QoS characteristic, 
i.e. the swiftness characteristic. A provider may also combine characteristics in a new 
characteristic (or perform other types of transformations). Instead of using image- 
height and image-width as characteristics, the provider may for example use image- 
size characteristic with permissible values like "CIF", "QCIF", etc. 



3.4 QoS Interface Specification 

In this section, we illustrate an approach to link the user requirements in r espect of 
QoS to the provider-oriented QoS classes in accordance with the QoS model ( |Fig. 4| . 

QoS specification scenario at control-plane 

In the example QoS class of Section 3.3, the transport provider guarantees that delays 
will not exceed 100 ms. This means that the provider has to be able to estimate the 
communication delay between the client and the VoD server before it offers the QoS 
class. 



User Interface Provider 




Fig. 5. QoS interface interaction scenario 

The scenario of |Fig. 5 illustrates how the user ne eds for QoS can be linked to a 
provider’s QoS capabilities using the QoS model of pig. 4| The user first queries the 
provider for the QoS classes that it supports. The user invokes a control-plane 
operational interface for this purpose. The provider then determines the set of most 
appropriate QoS classes for this user. What constitutes most appropriate may for 
instance depend on the delays between the participants of the communications 
session. The provider can for example consult the information base in the QoS 
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management plane to make an estimate of this delay. After having received the QoS 
classes, the user can select a suitable one and determine the values of the QoS 
characteristics that best match its requirements (i.e., urQoS). The user then responds 
with QoS requirements that re present the required qualifiers and values of the QoS 
characteristics (see also |Fig. 5 1 . 

QoS requirement derivation 

In the user-provider based QoS model, the user (typically in the form of application 
objects) must translate the urQoS dimensions and values to the midQoS 
characteristics. 

An important factor in this translation is application domain knowledge. Of 
particular importance is the application domain information that has been elaborated 
and documented by communication experts. Examples are international standards like 
the analyzed CEN/TC251 and EWOS/EG MED medical reports, e.g. [Ewos], Other 
ICT standards (e.g. CCIR.601) and results of ergonomic studies are also useful for 
QoS translations and mappings. 

The urQoS dimension availability typically maps to the accessibility, reliability, 
swiftness and urgency midQoS ch aracteristi cs. Integrity maps to accuracy, freshness, 
linkage unity and also swiftness. Table 2| illustrates the tra nslation o f availability 
values relevant in the medical VoD examples to accessibility. [Table 3 [ illustrates the 
translation of integrity values relevant in the medical VoD examples to accuracy and 
linkage unity. 



Table 2. Example availability urQoS to accessibility midQoS translation 



AVAILABILITY 

"diagnostic phase" 
"surgery phase" 



ACCESSIBILITY 

"offline normal" or "offline high" 
"online normal" or "online high" 



Table 3. Example integrity urQoS translations 



INTEGRITY 


ACCURACY 


LINKAGE UNITY 


"functional lossless" 


"CR degraded" or 
"Xray degarded" 


"sync normal" 


"perceptual lossless" 


"CR" or "Xray" 


"sync high" 



4. QoS Characteristic Mapping 

This section briefly describes the mapping of middleware QoS characteristics (e.g. 
accessibility, accuracy or swiftness) to the QoS characteristics of the underlying DRP. 

P shows the QoS mapping relations when we recursively apply the model of 
e interface between the middleware and the DRP. 
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Fig. 6. Control-plane QoS characteristic mapping 
and its impact on the data transfer plane 

The QoS mapper of Fig. 6 has to take the influence of the objects in the 
middleware’s data transfer plane into account when it determines the QoS 
requirements for the DRP (a). For example, if the accuracy midQoS characteristic 
value permits compression of video images, the video coding object in the data 
transfer plane typically introduces additional end-to-end delay. 

A QoS mapping may also involve peer-to-peer negotiation (|3). For instance, the 
two controller objects of Fig. 2 may need to agree on the encoding format and the 
encoding parameters to use. The DRP takes part in this negotiation in the role of the 
provider. This is because the DRP for instance has to provide a connection that has 
sufficient capacity to carry the output of the encoder. QoS characteristic mapping thus 
generally involves a multiparty negotiation. 

In some cases, we can (also) use QoS mapping tables to convert the midQoS 
characteristics and associated QoS requirements to QoS characteristics that are 
supported by the DRP. For example, linkage unitity "sync normal" for pointing 
device-to- video synchronization may map to a maximum delay difference in the range 
of -580 to +820 ms at the DRP-level [Steinmetz], 



5. Implementation 

We are currently building a testbed to validate our framework. We have developed a 
video on demand demo based on the light profile of the CORBA Management and 
Control of Audio/Video Streams RFP [AVStreams], The demo uses the Java Media 
Framework [Jmf] for streaming communications. The server and the client are 
interconnected by an RSVP-enabled router. The client allows the end-user to select a 
movie and conveys the user’s selection to the server. The server responds by 
str eaming an MPEG-1 video to the client using RTP/UDP. 

Ejg~ 7|shows an overview of the CORBA A/V Streams RFP in terms of the objects 
that we use (see Section 2.1). The objects shown in gray together represent a stream. 
The str_ctrl object acts as a control point for application objects, which can be used to 
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request the establishment of a multimedia stream with a certa in QoS. The str_ctrl 
object would therefore be present in a controller object of |Fig. 3| 



configure (...) 




Fig. 7. Overview of CORBA A/V Streams objects. 

The a_mmdev and b_ mmdev objects encapsulate multimedia devices on the client 
and server machines of pig. 3| respectively. They act as a factory for the *_vdev and 
*_sep objects. The a_vdev and b_vdev objects encapsulate the resources at the 
endpoint of the stream (e.g. a camera or a speaker) as well as the transport- 
independent media processing resources (e.g. an MPEG-1 encoder). They for instance 
engage in a peer-to-peer QoS negotiation process to determine the encoder and its 
settings to use. The *_mmdev and *_vdev objects would for this reason be divided 
over the application and middleware level objects of |Fig"l 

The a_sep and b_sep objects encapsulate the transport specific media processing 
resources such as an RTP packetizer and an RSVP entit y. Thes e objects would 
therefore be distributed over the middleware-level objects of |Fig. 3| Observe tha t the 

r* "* rface ,hat in— ajep and bjep ,s ,he ,ranspoB obi “' ofEa 



6. Conclusions 

The architectural model for middleware support for QoS provisioning presented in 
this paper is a combination of an object model, a layered model, parts of the ISO QoS 
model and a model consisting of planes. The model is suitable for structuring the 
middleware internals for establishing media streams. A key feature of our model is 
the identification of QoS control objects (postioned in a control plane) that are 
responsible for establishing a (multi) media stream based on a QoS agreement. 

Our approach is application oriented rather than system or network oriented. This 
is facilitated by the notion of a usage context and by the fact that the applications and 
our middleware-based system talk to each other in application oriented QoS terms. 

We have validated parts of our architecture. To support this claim, we have shown 
that the CORBA A/V Streams specification fits into our architecture and we have 
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validated this through an initial implementation. This proves that the composition of 
our architecture in terms of layers, planes, objects and interfaces is a viable one. 

We furthermore show that our notions of QoS specification can be expressed in 
QML and that the issue of QoS mapping can be realized in a heuristic manner by 
defining static mappings (e.g. by using mapping tables) between QoS specifications at 
the different layers. 
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Abstract. When it comes to digital video, users expect more than simple VCR 
functionality. This paper presents a system that provides both direct manipula- 
tion and conversational interaction for digital video. The paper shows the 
seamless integration of different annotation techniques (synchronous and asyn- 
chronous) combined with different presentation and interaction metaphors. Di- 
rect manipulation technique is implemented with graphical and acoustical video 
hyperlinks. These hyperlink techniques are discussed in detail with the focus on 
their temporal nature driven by the duration of the accessibility of objects within 
a digital video. An avatar system is used to give the user an additional conver- 
sational access to information that is related to the video. This access is inde- 
pendent of the playback time of the video content. These techniques are imple- 
mented. Several system components like video server, video presentation and 
video authoring are shown. The user interaction capabilities based on the tech- 
niques mentioned above are presented. 



1 Introduction 

Beside virtual reality and animation, digital videos play a major role within multime- 
dia applications and, therefore, in video on demand systems. The reason can be found, 
on the one hand, in the nature of video as a direct view of the real world, not a simula- 
tion. On the other hand, there are large datasets of preproduced video, created over 
years of TV and movie production. A major benefit of multimedia applications lies in 
the user interaction. The interaction concepts of digital video have to be taken to the 
advanced interactivity level of animation and virtual reality. This will be a key feature 
of video on demand systems and the essential benefit of this kind of web-based system 
in relation to interactive television. 

The actual research on interactive digital video is strongly influenced by web tech- 
nology and multimedia productions. Beside traditional interaction methods like VCR 
functionality (change video, start, stop, fast forward, etc.) or the electronic program- 
ming guide (epg) of digital television, the direct manipulation interaction of the hy- 
permedia metaphor is transferred to video. This method of annotation is divided into 
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intramedia annotation (annotating content of one media channel within the same me- 
dia channel) and intermedia annotation (annotating content of one media channel 
within another media channel). 

Intermedia hypermedia annotation is used by Digital Renaissance [Martel96] in the 
form of text hyperlinks for the synchronous annotation of the graphic and acoustic 
content of digital video. Intramedia annotation of the graphic content of digital video 
is done by VOSAIC [Camp95], Real [Hefta97] and ZGDV [Gerf98], Intramedia an- 
notation for the audio content of digital video was proposed by [DeRoure98] and 
[Braun98], DeRoure applied hypermedia structures to audio; Braun showed an acous- 
tic hypermedia annotation of the audio content of digital video. An overview on the 
topic of hypervideo is given in [Saw96]. 

These hypermedia approaches for digital video disregard the temporal structure of 
the medium. Digital video happens over time, meaning it has a start, a duration, and an 
end. The same is true for hypermedia annotations of digital video - they have a start, a 
duration, and an end. This structure has to be shown to the user, as proposed by 
[Braun99]. 

Multimedia presentations are often dealing with interaction facilities inside the 
content as described by [Bohm93]. They are done with doors, touch-sensitive buttons, 
etc., within virtual reality or via interaction possibilities that are added onto the con- 
tent, like menus or navigational bars - for interaction onto content and for application 
navigation. A major metaphor for these applications is conversational interaction with 
human-like avatars; see [Shne95]. This conversational interaction is heavily used for 
assistance systems where users are not able to do direct manipulations on objects, but 
delegate their work to an agent; see [Mandel97]. 

In order to develop an interaction component on a digital video on demand system, 
it is of interest to combine the direct manipulation approach of hypermedia and the 
conversational interaction as done with human-like avatars. Therefore, we present a 
digital video on demand system that includes two interactive elements: 

• hypermedia annotation of the graphic and acoustic content of the digital 
video 

• conversational interaction 

The hypermedia annotation is used for synchronous annotation of the digital video’s 
content - and, therefore, for interaction on the objects within the graphics and acous- 
tics of the video. In fact, the hypermedia annotation of the video is a form of direct 
manipulation - only possible during the temporal appearance of the video objects. 

The avatar system is used for a conversational interaction on the video’s objects 
without a limitation of the object’s appearance time and not limited to the objects of 
the digital video either. Hence, the conversational avatar system can be used for syn- 
chronous and asynchronous access to information on the video content. Figure 1 
shows how an avatar is used within a conversation to show additional information to 
the user through both gesture and speech. 
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Figure 1: Conversational Video Interaction 



The advantage of the combination of both interaction metaphors for digital video lies 
in overcoming the video’s temporal nature: 

• time-dependent interaction on content that is actual visible within the video’s 
presentation 

• time-independent interaction on content regardless of the content’s playback 
time; for example, on content that was presented or is yet to be presented by 
the video 

In the following paragraphs, we introduce a hypermedia system and an avatar-based 
conversational user interface for video interaction. Then, we define the video on de- 
mand system and its user interaction component. Lastly, we provide a conclusion and 
comment on our future work. 



2 Hypermedia for Digital Video 

Hypermedia systems are widespread due to the explosive growth of the World Wide 
Web. Hypermedia was developed for discrete media like text or picture, accessible to 
the user as long as he does not explicitly change the content. The hypermedia annota- 
tion is, however, accessible as long as the content is accessible. Due to its appearance 
in time, continuous media have some special abilities: content is presented within a 
time interval; it has a start, duration and an end. The content is, therefore, not accessi- 
ble during the entire duration of a presentation. Due to the temporal basis of video 
content, most presentation systems have some kind of time index, like a time line with 
a cursor in it that shows the actual position within the video. This time index could 
also be shown for the annotation of the hypermedia system. 

The hypermedia annotation of digital video can be done in an intermedia way. The 
term ’intermedia’ designates the technique of having different media channels for 
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annotation and content. This kind of digital video interaction is done within the system 
of Digital Renaissance [Martel96]. The system shows additional text hyperlinks, syn- 
chronous to the video’s graphics and acoustics. Due to the change of media channel, it 
is hard for the user to understand which object within the graphics or acoustics is the 
annotated one. The name of such a textual hyperlink indicates the connection between 
object and hyperlink. This technique has two disadvantages: 

• The syntactical operation of an annotation is dependent on the semantics of a 
text. 

• The user has to focus on the video content and observe the textual hyperlinks 
simultaneously. 

The term 'intramedia' annotation of digital video designates the presentation and an- 
notation of video content within the same media channel. Therefore, it has to be di- 
vided into the annotation of the video graphics and the annotation of the video acous- 
tics. In both cases, the annotation can be presented explicitly or implicitly. 

• The explicit presentation of a hyperlink is shown to the user during the hy- 
perlink’s entire active time. 

• The implicit annotation shows the existence of an annotation only upon user 
demand, meaning the user has to show his interest in an object’s annotation 
to get a message indicating the existence of any additional information on the 
object. 

A typical problem of intramedia hypermedia annotation is the indication of the tempo- 
ral duration of such a continuous media hyperlink. The following information is of 
particular interest to the user in order to prevent undue stress when using this tech- 
nique: 

There should be 

• a starting point of the annotation that indicates the beginning of the hyperlink. 

• a stopping point of the annotation that indicates the end of the hyperlink. 

• a temporal progression of the annotation that allows the user to imagine how 
much time is left until the end of the hyperlink. 

In order to make interaction user-friendly, these points have to be worked out for both 
the hypermedia annotation of the video’s graphics, as well as that of the video’s 
acoustics. 



Video 

An example of hypermedia annotation of the video graphics in an implicit way is 
illustrated in the system of Real [Hefta97] called the Real Player. The Real Player 
shows an annotation by a change of the mouse cursor. The system of VOSAIC 
[Camp95] shows the annotation of the video’s graphics explicitly. The annotation is 
marked as a polygon around an annotated object. 

The MovieGoer of ZGDV [Finke2000] is adjustable - the user can choose whether 
to have an explicit or implicit annotation. If the annotation is implicit, all objects of 
the video content are at least annotated by a default hyperlink. If the annotation is 
explicit, the annotated objects are marked by a polygon. The graphical video hyper- 
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links are activated by the user’s mouse click on the annotated object within the video’s 
display. Figure 2 shows how additional information is presented after the user has 
clicked on a video object. 




Figure 2: Interactive MovieGoer System 

The DIVA System of ZGDV [Braun99] uses explicit annotation technology to show 
the user the temporal length of the hyperlink. This is done via a change of the colour 
of the hyperlink’s graphic. The hyperlink is started with the polygon shown in the 
colour A, ending the hyperlink with the polygon coloured in the colour B, and chang- 
ing the colour of the polygon from the bottom to the top of the polygon from colour A 
to colour B during the hyperlink’s duration. Figure 3 (left hand) shows how this tech- 
nique is used. The rearmost picture is annotated with a near white polygon while the 
black colour is growing dominant to the annotation of the fore picture. With this tech- 
nique, the user can imagine right from the start of the hyperlink how long it will take 
until the hyperlink ends; see [Braun99]. 

Audio 

Sonic hyperlinks [Braun98] developed by the DIVA System of ZGDV are used as 
intramedia annotations of audio. A sonic hyperlink is defined as an acoustic annotation 
of acoustic information. Usually, this is some soft sound playing in parallel to the 
acoustic content of a digital video. User reaction on this hyperlink is acoustic, too, 
meaning he can react via his voice by predefined vocal commands. This way, the user 
has the opportunity to interact with the acoustic video content in a verbal way. The 
sonic hyperlink’s nature as a sound annotation makes it an explicit annotation of con- 
tent. 
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Figure 3: Showing the Duration of: a graphical Hyperlink (left) and a Sonic Hyperlink (right) 



The annotation of the acoustics can be done by additional sound or by the variation of 
the original acoustic content of the digital video. Variations on the original sound of 
the video lead to user confusion [Braun99], Therefore, an additional sound is pre- 
ferred. To indicate the duration of the sonic hyperlink, a two-tone solution proves to 
be most fitting [Braun99]. Using two tones, A and B, at the beginning of the sonic 
hyperlink, changing the frequency of A in little steps towards B, and ending the hy- 
perlink when the changed tone A meets the sound of tone B indicates the start, the end 
and the remaining time on every time position within the sonic hyperlink’s time inter- 
val. Figure 3 (right hand) shows how the tones change throughout the duration of the 
hyperlink - until they finally become the same tone. 

User interaction on sonic hyperlinks should work by performed voice in order to 
use one single media channel for the content, the annotation and the user reaction. The 
user’s visual channel is not affected by the interaction on the digital video’s acoustic 
content. Figure 4 shows how a user interacts by using sonic hyperlinks and his voice/ 




Figure 4: User Interaction by Voice 
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3 Avatar and Conversational Interaction 



User interfaces are changing from tool-based systems using the WIMP metaphor to 
assistance systems using the conversational interaction metaphor. To simulate a hu- 
man-like conversation with an assistance system, the use of a human-like avatar with 
social and emotional competence as an ambassador of the system is a common tech- 
nique. Conversational user interaction for digital video with a human-like avatar is 
both an information source, as well as an emotional companion, for the user. As 
shown by [Bente97], the user is prepared to get in a social and emotional conversation 
while watching video content. This can be used by an avatar system that is able to get 
into a conversation about the video’s content. Figure 6 shows how the avatar can be 
utilised parallel to a video presentation to give conversational access to additional 
information. (Figure 5 shows a prototype application using our video on demand sys- 
tem to explain some research results of the ZGDV departments.) 



I Diva II - Netscape 




Figure 5: Using an Avatar for Conversational Interaction 

An avatar system model for annotated digital video can be constructed of three main 
components. 

• The first component, synchronous video contribution, is an additional con- 
tent- and emotion-providing unit. It simultaneously provides output to the 
video in the form of commentaries combined with emotional behaviour. 

• The second component, asynchronous video contribution, is a conversational 
interaction unit that answers user questions based on the video’s content. 
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• Feedback to user actions is the third component of the avatar system model. 
The feedback is given on user interaction with the hypermedia annotation of 
the video and on user navigation onto the video. 

The three parts together determine avatar behaviour during the presentation of the 
video. 

From the user’s perspective, the avatar is either acting as a conferencier, guiding 
the user through the entire video presentation, or he is acting as a companion or col- 
league, a fellow member of the video audience and, therefore, a bystander to the video 
presentation. The following paragraphs will discuss the three components of the avatar 
system in relation to the video presentation. 



Synchronous Video Contribution 

This component is determined by the video’s presentation over time. While the video 
is active, the avatar is following a predefined dialogue. The predefined dialogue con- 
sists of an emotional part, as well as an additional contribution related to the video 
content. The dialogue is synchronous to the video and can be viewed as another form 
of video annotation in addition to the hypermedia annotation of the video. The behav- 
iour of this component is constant, meaning every time the video is played, the avatar 
will get the same dialogue content. 



Asynchronous Video Contribution 

The conversational interaction unit is realising an indirect access to the video content. 
This presents the opportunity to get asynchronous access to additional information 
linked to objects of the video content - independent of objects actually visible in the 
video. Therefore, the avatar’s behaviour is not constant through the video presentation, 
but adapted to user questions. 



Feedback Contribution 

Each user interaction on the digital video requires a defined feedback [Bau95]. This 
feedback can be one of the following types: 

• Assistance feedback - if the user has interacted with the video content via the 
annotations on the content. 

• Navigational feedback - based on the video player VCR functionality. 

The feedback is given via the avatar in the form of guidance through the whole digital 
video system. Therefore, the user will always have an emotional companion when he 
is interacting with the video. The feedback is based on [Perez96]. It has the following 
different states: 

• Not aware of the utterance 

• Aware of the utterance, but did not hear it 

• Heard the utterance, but did not understand it 
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• Fully understood the utterance 

The assistance feedback consists of immediate additional information on the video’s 
annotations and the presentation of the available additional content. Figure 6 shows 
three forms of feedback contribution: 

• The upper row shows an accepting reaction 

• The middle row shows an incomprehensible reaction 

• The lower row shows a non-accepting reaction 




Figure 6: Feedback Contribution of the Video on Demand System 



4 System Architecture 

The system architecture follows the requirements of an interactive digital video envi- 
ronment that combines a direct manipulation approach based on hypermedia (see 
paragraph 2) and a conversational interaction approach based on human-like avatars 
(see paragraph 3). The video itself is annotated in order to provide content-based navi- 
gation and additional information to the user. An avatar is designed as an assistance 
for synchronous, asynchronous and feedback contribution to the video content. 

The system consists of three modules: a content server, a presenter client and a cli- 
ent-server connection. In addition, an authoring environment is set up to provide tools 
to create content based upon our approach. The system uses the Real video server 
[Hefta97] to stream media content from server to client. In addition, the system uses 
the SMIL technology [SMIL98] to synchronise media streams. The Real server is used 
to handle the following media: 

• video and its hypermedia annotation 

• synchronous and asynchronous avatar behaviour and speech 
The media synchronisation is based on SMIL. 

The system client is based upon Real Media with a Java-3D rendering plug-in. 
Figure 7 shows the system architecture with its synchronous and asynchronous data 
storage on the server side, as well as the Real Media plug-ins for the presentation of 
the video on the client side. 
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Figure 7: System Architecture 

The authoring process is performed with tools that prepare the information to be 
served by the Real server and to be visualised by the Real client. 



Authoring 

The authoring of the system is divided into two components: the authoring of the ava- 
tar geometry and behavioural animation sequences, as well as the authoring of the 
video hypermedia annotation, the avatar behaviour triggers, and the additional content 
presented by the avatar. 

The geometry and animation sequences of the avatar are authored in VRML; there- 
fore, every VRML supporting 3D animation tool can be used. The animation of every 
kind of avatar behaviour is transferred to a feature graph [Alex99]. The Java-3D ren- 
dering plug-in is using feature morphing to animate the avatar geometry on the basis 
of the feature graph of the particular behaviour. If a behaviour is stored as a feature 
graph in the system, it can easily be triggered by an avatar behaviour trigger. If the 
behaviour is not stored in the system, a feature graph of the behaviour can be given to 
the Java-3D rendering plug-in to animate the behaviour. 

The authoring of the interaction facilities of the digital video is divided into a syn- 
chronous and an asynchronous part. The synchronous annotation of the digital video 
contains the hypermedia annotations (both graphic and audio annotation), as well as 
the additional behaviour and information given via the avatar. The asynchronous part 
contains the avatar’s general behaviour towards user questions, as well as a content 
database with information to answer user questions. 

• Video Hyperlinks: The annotation of the graphic content of the video is done 
in an intramedia way on the basis of a MovieGoer [Gerf98] and Real Video 
combination. The hypermedia annotation is split into the graphic annotation 
and the hyperlink target. The graphical annotation is inserted into the original 
video; the timing and spatial information, as well as the hyperlink target (for 
example, a URL), is stored separately from the video. 

• Sonic Hyperlinks: The annotation of the acoustic content of the video is done 
in an intramedia way on the basis of the Sonic Hyperlink [Braun98]. The 
acoustic annotation of the audio information is stored within the video. The 
temporal and spatial information, as well as the hyperlink target (for exam- 
ple, another part of the video), is stored separately from the video. 






182 



N. Braun and M. Finke 



• Avatar information and behaviour: The avatar’s behaviour is given via an 
avatar behaviour trigger (or feature graphs) and the text that the avatar has to 
speak. This content is stored with temporal information (in regard to its syn- 
chronous or asynchronous manner related to the video) within a text file. 

• The additional content information for answering user questions is stored in a 
content database. 

The avatar’s geometry and animation is the basis for interactive videos cooperating 
with avatars. The video annotation, as well as the additional avatar text and additional 
content information, are dependent on the given video. 

Content Server 

The content server consists of a Real video server and a database containing videos 
and their relating information: hypermedia annotation (temporal, spatial, hypermedia 
targets), additional information and SMIL - synchronized avatar behavioural triggers 
and text. 

The information located on the server side at the beginning of a session can be di- 
vided into three functional parts: 

• The first part contains the necessary information to establish an avatar on the 
client side that is immediately ready for conversation. Therefore, this is the 
asynchronous avatar information that is responsible for the conversational 
interaction and the avatar feedback system (see paragraph 3). 

• The second part contains the video and the synchronous information, which is 
the avatar behaviour trigger and the text that the avatar is speaking on the cli- 
ent side. The video and the synchronous information are played simultane- 
ously, controlled by SMIL. 

• The evaluation of the user’s interaction is done with part three of the server 
information which is the temporal, spatial and target information of the hy- 
permedia annotation, as well as the general information for user’s conversa- 
tional interaction. This information is kept on the server side until it is re- 
quested by a user interaction. Unnecessary datatransfer between server and 
client is avoided. 




Figure 8: Server Architecture 
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Figure 8 shows the server architecture from the aspect of data storage. The data stor- 
age is hereby divided into the synchronous and asynchronous data. The server contains 
several RealMedia plug-ins that enable the RealMedia Server to serve the data and to 
synchronize with SMIL. 

Presenter Client 

The interaction with the medium video is initiated on the client side. The client side 
provides the user with the ability to interact with objects occurring in the video in 
order to retrieve additional information. In addition, users can retrieve information by 
querying the avatar. The client structure contains six parts, see figure 9: 

• an internal media distributor that streams the different media types to the dif- 
ferent client plug-ins 

• a real video client 

• a decision handler, a rule system for computing the avatar behaviour 

• the Java 3D avatar Tenderer 

• a speech recognition system that converts user questions to requests on video 
content 

• the request unit that sends a request to the server database 

The internal media distributor receives the video and the SMIL-based synchronous 
and asynchronous avatar behavioural triggers and the synchronous avatar text. The 
unit transfers the complete asynchronous avatar behavioural triggers to the rule system 
before it starts to play the video and to transfer the synchronous avatar behavioural 
triggers and text to the rule system. In the case of a user interaction on the hypermedia 
annotation of the video, the video client notifies the rule system that an interaction by 
the user has occurred. The temporal and spatial coordinates of the user interaction is 
transferred to the content server. 




Figure 9: Client Architecture 
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The rule system calculates the actual behaviour of the avatar. The avatar’s behaviour 
upon user interaction is based on the asynchronous avatar behaviour triggers that are 
stored before the start of the video presentation. The actual behaviour of the avatar 
consists of the synchronous avatar behaviour trigger and text, and the actual user in- 
teraction behaviour. The calculated behaviour and the avatar’s speech in the form of 
text is transferred to the avatar Tenderer in the form of avatar behaviour trigger and 
text. 

A speech recognition system calculates the user’s speech inputs to send requests on 
the video content and additional video information which is stored on the server side. 
These requests are given to the request unit which is performing a query towards the 
server-side database system. 



Client-Server Connection 

The client-server connection consists of three RealMedia channels. There are two 
forward channels from the server to the client, transmitting the video and synchronous 
avatar information, as well as the additional information channel with the asynchro- 
nous avatar information and any other information that is stored in the database. The 
back channel, the connection from the client to the server, serves the temporal and 
spatial information of selected hypermedia annotations, as well as requests to the 
content database. 



User Interaction 

There are two different ways in which the user can retrieve additional information 
from the system within the video presentation on the client side. 

• One possibility is for the user to interact with the video directly, meaning that 
he could click onto an annotated object of his interest with a mouse cursor in 
the video display or activate a Sonic Hyperlink annotated object of the 
video’s acoustic and request additional information stored on the server. Fi- 
nally, the avatar in his role as a conferencier presents the additional informa- 
tion to the user. This user interaction possibility is a direct manipulation onto 
the objects of the video content. 

• The user can also interact with the video content indirectly by asking the 
avatar questions. This way, additional information can be retrieved through 
the whole video presentation without any limitation on time-dependent hy- 
perlinks. For this interaction, speech recognition is the interface unit between 
the user and the video content. This user interaction possibility is conversa- 
tional onto the objects of the video content. 
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• Every user interaction, direct as well as indirect, with the video content leads 
to a feedback reaction of the avatar. In this way, the avatar guides the user 
through the video from the beginning to the end. As a guide or as a compan- 
ion within an interactive video presentation, the facial expressions, emotion 
and speech of the avatar will always be present to encourage the user to work 
with the video in order to retrieve the desired information. 

Figure 5 shows the avatar and the video on the left window side. The result of a user 
request is displayed on the right side of the window - the avatar is giving some addi- 
tional information via its voice and facial expressions. Figure 5 shows a prototype 
application that uses the system presented in this paper to explain some research re- 
sults of the ZGDV departments. 



5 Conclusion 

An integrated video on demand system for the authoring, service and presentation of 
digital interactive video is presented. The digital video’s interaction component con- 
sists of two parts: 

• a direct manipulation component realised with hypermedia continuous media 
annotation 

• a conversational interaction system based on an avatar 

The system provides the possibility of time-based (synchronous) and time-independent 
(asynchronous) interaction on video content to the user. The interaction possibilities 
are adapted to the video’s temporal nature. The user can get direct access to object 
information independent of the object’s visibility. The information may concern ob- 
jects the user actually sees or objects that were presented previously or will be played 
back later on within the presentation. 

We presented a Server System, a Client Presenter, as well as an authoring envi- 
ronment for an interactive video on demand application. 

The next work on the system is the integration of gesture and gesture/speech com- 
bination for multimodal user interaction. The digital video interaction will be en- 
hanced to stereo sound and stereo acoustic interaction. 
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Abstract. In this paper we propose a light-weight, provable secure 
smart card integration for the OpenPGP secure message format. The 
basic idea is that the secret keys are stored on a smart card and never 
leave it. We have integrated this new security approach into an enhanced 
whiteboard, the digital lecture board (dlb) . Existing whiteboards neglect 
security mechanisms almost completely, even though these mechanisms 
are extremely important to allow confidential private sessions and billing. 
The primary application field of our concept are small and closed groups, 
whereas the smart card serves to testify group membership. Our first im- 
plementation supports the JAVA i-Button which provides an additional 
hardware security. 



1 Introduction 

Video conferencing via the Internet has become more and more important during 
the last few years, its application fields ranging from telemedicine to distribu- 
ted project meetings in global companies. Typically, a video conferencing system 
combines several media types such as audio, video and whiteboard. A whiteboard 
offers a shared workspace where slides can be presented to the conference group 
and where documents can be edited by all participants. Besides audio, the white- 
board is the most important instrument for sharing knowledge among distributed 
participants for many application fields (e.g., distance education |IKfGe98U . But 
at the same time, the features offered by existing whiteboards often are not suf- 
ficient for effective conferencing I CeEm Shortcomings mainly concerned the 
user interface, media handling, and the collaborative services needed to sup- 
port group interaction. To overcome the weaknesses of existing whiteboards, we 
developed the digital lecture board (dlb) |Geye98| . 

* A part of this research was done while the author was at the University of Mannheim. 

** Since 15. March 2000: IBM, T.J. Watson Research, USA. 

* * * Supported by the Deutsche Forschungsgemeinschaft (DFG) grant KR 1521 



H. Scholten and M. van Sinderen (Eds.): IDMS 2000, LNCS 1905, pp. 187-^^^ 2000. 
(c) Springer- Verlag Berlin Heidelberg 2000 
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One important issue we had in mind while designing the dlb was to develop 
a security concept in order to protect sensitive information and to restrict access 
to sessions IGeWe98IGe WeOfll . In this paper we describe further enhancements 
to the existing security mechanisms of the dlb in respect of both high security 
and user-friendliness, especially for small and closed user groups (e.g., personal 
decisions or business conferences). 

The basic idea is to use the new JAVA card technology to provide a secure 
place for the valuable secret keys that are used to encrypt/decrypt data. One 
main advantage of JAVA card technology is the ability to load different appli- 
cations and secret keys very easily. Thus, only one card for is needed different 
sessions. 

In our tests we used the i-Button resp. the JAVA Ring by Dallas Semicon- 
ductor p9| - which provides a very high hardware security |BWL00| . 

We also present the integration of the two very promising, fast and free candi- 
dates for the DES (Data Encryption Standard) successor Advanced Encryption 
Standard (AES) Twofish and Rijndael. Additional, we have implemented for the 
first time the new provable secure DES 2 X construction. 

The remainder of this paper is structured as follows: after a brief description 
of the main features of the dlb, we describe the security concept and the new 
features of the dlb, especially the novel smart card protocols. We conclude this 
paper with a summary and an outlook. 



2 The Digital Lecture Board (Dlb) 

Shared Workspace. The user interface of the dlb is depicted in Figure |T| 
A document created with the dlb consists of an arbitrary number of pages, 
whereas a page is composed of an arbitrary number of objects such as imported 
postscript slides and images, graphical objects (e.g., circles, freehand lines, etc.) 
and text. At a certain point in time exactly one page is displayed in the so-called 
shared workspace. User actions on the shared workspace (drawing, pointing, etc.) 
are generally visible to all participants of the session. Furthermore, a private 
workspace allows the preparation of documents without interfering with the 
ongoing session. Documents created within the public or private workspace can 
be stored on disk in an SGML-like file format for later reuse. 



Collaboration Tools. Because the number of communication channels nor- 
mally used in a video conferencing environment is restricted, explicit collabora- 
tive services are needed to organize a session and to increase awareness among 
the participants. For example, technical problems occurring during a session 
(e.g., bad audio quality due to packet loss) often result in one or more partici- 
pants writing directly onto the displayed whiteboard page, thereby interrupting 
the regular session. Furthermore, the designation of a speaker during a lively 
discussion is awkward in a session with only one audio channel. And the video 
quality is often not good enough to recognize facial expression or to see a rai- 
sed hand in a large audience. All these problems increase with the number of 
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Fig. 1 . User Interface of the dlb 



participants. Therefore the dlb offers a variety of collaboration tools to support 
social communication between the participants: 

— The integrated chat may be used to discuss technical issues without distur- 
bing the speaker. 

— With the help of the attention tool a participant can raise his/her hand to 
announce his/her intention to speak. 

— To create a common point of reference within the shared workspace a tele- 
pointer is available. 

— Voting allows additional feedback by polling opinions about certain session 
criteria (e.g., presentation quality). 

A more detailed discussion of collaborative services supported by the dlb can 
be found in (GejeSg and IWeCe^l . 



Communication Model. The distribution architecture of the dlb is replica- 
ted, i.e., each participant’s instance of the dlb holds a complete copy of the 
session data. To maintain a consistent shared state, user events such as drawing 
and writing are exchanged between all instances. The communication model of 
the dlb is layered. The Whiteboard Transfer Protocol (WTP) is the application 
protocol of the dlb. WTP defines packet formats and the semantics for creating 
graphical objects or pages, for telepointer data, etc. WTP packets are the pay- 
load of Realtime Transport Protocol (RTP) packets [SCF.T96) , a protocol that 
was chosen for several reasons. The timestamps of RTP allow the synchroniza- 
tion with other RTP-compatible data streams (e.g., audio, video). Furthermore 
RTP makes it possible to use existing MBone recording systems. And RTP pro- 
vides light-weight session control through RTCP. The security concept described 
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later on is compatible with the OpenPGP (OPGP) standard ( 'aea07 . OPGP 
is an open internet standard that is compatible to the de facto standard Pretty 
Good Privacy (PGP). The OPGP layer realizes the encryption/decryption of 
the transmitted data, i.e., RTP packets are wrapped into OPGP packets. We 
then use either unreliable UDP connections (e.g., for telepointer data) or relia- 
ble SMP connections to transmit the OPGP packets. Scalable Multicast Protocol 
(SMP) is a reliable transport service developed in the context of the dlb project 
[ I Gmmh7 | - 



Secure Communication. The well-known DES encryption algorithm, which 
was originally designed for confidential, not-classified data, is used in many 
applications today (e.g., electronic banking). The MBone whiteboard wb also 
relies on DES for encryption. However, DES is not secure anymore since e.g. 
the non-profit organization Electronic Frontier Foundation (EFF) has built a 
hardware DES cracker [EFF98| . In recent years, novel algorithms that perform 
better while being similar to the DES scheme have been developed [Weis98f . 

Due to the US export restrictions, export versions of many software products 
have the DES encoding disabled. Thus, outside the U.S., the wb’s DES encryp- 
tion feature has not been avaible for a long time. Moreover the source code of 
wb is not publicly available which inhibits the evaluation or modification of the 
cryptographic implementation. These security limitations of the MBone white- 
board wb have stimulated the integration of modern encryption algorithms into 
the digital lecture board in order to provide secure conferencing with a powerful, 
collaborative whiteboard developed outside the US. 



User— Orientated Cryptography. The dlb employs a flexible user-oriented 
security concept that can be adapted to different user requirements. Users may 
choose from predefined security profiles or customize their own security requi- 
rements. The choice may be driven, for instance, by legal issues, costs, required 
level of security, and performance. We identify the following main profiles or user 
groups: public research , financial services , and innovative companies. 

Since users who work in public research often benefit from license-free em- 
ployment of patented algorithms, we rely on the IDEA cipher. IDEA was desi- 
gned by Xueja Lai and James Massey. The algorithm has a strong mathema- 
tical foundation and possesses good resistance against differential cryptoana- 
lysis. Many cryptographers think that IDEA is the strongest public algorithm 
[ Schn96l | . IDEA was the preferred cipher in PGP (Pretty Good Privacy) until 
version 2.63. However, commercial users have to pay high license fees. 

In the financial services business we find a strong preference for DES-based 
systems. Since Single DES has been cracked by brute force attacks, we suggest 
to use Triple-DES in this application field. 

For innovative companies which are not afraid of new algorithms, we use 
the novel, license-free algorithm CAST. Since January 1997, CAST is freely 
available. CAST is the preferred cipher in PGP since version 5. The Canadian 



